Generated with a machine learning model I've been tinkering with. Finally got one that works fairly well with Archspire, and early on in training no less - training a new version now and will reupload most of bleed the future once its finished there are definitely vocals bleeding through still. This new version is the result of an entirely new architecture. This one is fun, but since I'm not in academia or research and would never fit in there I'll go ahead and just talk about it in a fucking youtube description lol...
This one is just a single U-Net interestingly enough not a dense net, and this is at the 4th epoch (1st at the highest learning rate of 0.001). The main difference here is that frames in the spectrogram do NOT see each other outside of the frame transformer. All encoders use 3x1 kernels to convolve only features of the same frame, thus encoding frequency only.
After 5 encoders all inter-frame communication occurs as part of the (slightly modified) evolved transfomer encoder architecture (post-norm instead of pre-norm though that will probably change). So the only places where frames see other frames is in the transformer block at the 1x3 convolution, the 1x9 separable convolution, and at the multiheaded attention (where each head is in effect a frequency band). The frame transformer bottlenecks the input to a single channel using a 3x1 kernel and is then concatenated to its input - this is the only time information moves between frames as all bottlenecks are 1x1 kernels.