Flub - Flub - 01 Last Breath (Instrumental)
Feel free to make requests.
New version of the model; definitely some places for improvement but its still training (yet another version built from the ground up). This is only three epochs through the full pre-augmented dataset, although the full dataset is over 14 million songs now so a true full epoch likely won't be possible... There are 768 pairs of instrumental/instruments+vocals, and there are 10,372 or so instrumental songs that are randomly paired with 1,379 vocal tracks for a total of 14,302,988 unique combinations of songs alone (and this is before they are diced into 40 second slices for a total of around 85,817,928 unique training items before any further augmentation)
Next step for this project will be to develop an app that people can use themselves to convert their favorite albums, current plan is to have both a mobile and desktop variant however the mobile variant would require cloud computing and thus likely require a very limited number of users or a small fee. Desktop would be 100% free though (requires a pretty beefy gpu).
This architecture is what I call a frame primer; it is a custom neural network architecture I came up with inspiration from multiple research papers that I trained on a custom dataset. It is a single residual u-net that downsamples only the frequency dimension while preserving the temporal dimension. The residual u-net consists of 6 frame encoders with the first not downsampling, and 5 frame decoders; these use a stride of 2x1 and a kernel size of 3x3. Each frame encoder is followed by a frame primer encoder module. The frame primer modules extract a specified number of channels from the convolutional feature maps; each extracted channel is divided into a specified number of bands with which attention is then calculated in parallel. In a sense, this is now "multihead multi-dconv-band attention" rather than just a single channel multi-dconv-head attention that calculates attention between frames for a single channel. The output of the frame primer modules are concatenated with their input and sent to the following encoder or decoder for further processing in the u-net. Each decoder is followed by a frame primer decoder which uses the primer decoder architecture, however instead of 'memory' I refer to the second attention path as skip attention for the skip connections in the u-net which allows for the upsampled representation to query the pre-downsampled representation for global information.