Have a new model in progress. This has only seen the full dataset twice, and even then given augmentations that isn't entirely true (technically with my augmentations there are 13 million songs in my dataset which just isn't going to be feasible for me to train on consumer hardware lol). I am still allowing this version to continue training; I will likely stop training once its reached 10 full epochs, but that really will be determined by whether or not its still learning at that point.
This model is a pretty big departure yet again. The new architecture uses a residual u-net as a backbone; after each encoder and decoder there are what I call frame primer encoders and decoders; these now extract multiple channels from the convolutional feature maps and carries out multihead attention on each extracted feature map in parallel using a 5d tensor and batched matrix multiplication. The frame primer decoders utilize the primer decoder architecture, however instead of the memory being the output from the full decoder run it is the output from the frame primer encoders concatenated with its input which allows for the model to query for global information from before loss of information via downsampling. It also makes use of rotary positional embedding now.