Have a new model in progress. This was a low quality version of this song that had been uploaded to youtube already so yeah, bit of double youtube compression going on lol. This has only seen the full dataset twice, and even then given augmentations that isn't entirely true (technically with my augmentations there are 13 million songs in my dataset which just isn't going to be feasible for me to train on consumer hardware lol). I am still allowing this version to continue training; I will likely stop training once its reached 10 full epochs, but that really will be determined by whether or not its still learning at that point.
This model is a pretty big departure yet again. The new architecture uses a residual u-net as a backbone; after each encoder and decoder there are what I call frame primer encoders and decoders; these now extract multiple channels from the convolutional feature maps and carries out multihead attention on each extracted feature map in parallel using a 5d tensor and batched matrix multiplication. The frame primer decoders utilize the primer decoder architecture, however instead of the memory being the output from the full decoder run it is the output from the frame primer encoders concatenated with its input which allows for the model to query for global information from before loss of information via downsampling. It also makes use of rotary positional embedding now.