Reuploading, bought the wav file from bandcamp for higher quality and have a model that is 1/2 epoch further and has a lower validation error.
Have pretty much finalized my current architecture. Current version is learning fast and is still learning rapidly without even dropping the learning rate yet. Quite excited to see how this version ends up, as its technically only on the 4th epoch lol.
This architecture is using what I call a frame transformer. This uses Nx1 kernel convolutions with a stride of Mx1 to encode/downsample frames of a spectrogram without respect to time. After encoding, the decoding process takes the encoded frames and processes them using the evolved frame transformer architecture modified to use relative positional encoding from Google's Music Transformer. The decoding process makes use of transformer blocks before every upsampling of the frequency axis as well as the output layer and uses the skip connections from the encoding portion of the u-net as memory, allowing for the upsampled representation to query against its original representation for global information.