Wanted to upload a more known song for reference. This model has only gone through its dataset twice (so its on the second full epoch only). It still has quite a bit of training left, however I wanted to post some videos to show its accuracy at the 2nd epoch. This is still a work in progress, however I think the quality is sufficient to post a teaser video. I will be letting this model train for a few more days and will post updated videos then. This was made by feeding the audio of the music video into my neural network, will probably find a higher quality version if there is interest.
This model is a pretty big departure yet again. The new architecture uses a residual u-net as a backbone; after each encoder and decoder there are what I call frame primer encoders and decoders; these now extract multiple channels from the convolutional feature maps and carries out multihead attention on each extracted feature map in parallel using a 5d tensor and batched matrix multiplication. The frame primer decoders utilize the primer decoder architecture, however instead of the memory being the output from the full decoder run it is the output from the frame primer encoders concatenated with its input which allows for the model to query for global information from before loss of information via downsampling. It also makes use of rotary positional embedding now.