The Devils of Loudun - Escaping Eternity - 01 The Scourge of Beasts (Instrumental, wip)
Back with a new model after moving! Tons of changes to this model as opposed to previous versions, and the loss is by far the lowest at just the 9th epoch. Excited to train this further.
The main changes in this version is basically the addition of skip connections for attention as well as a new pooling mechanism I coded which is used for downsampling in this residual u-net. The core architecture is a residual u-net. Each encoder has either a modified CBAM or CMAM attention, CMAM being the convolutional multihead attention I've talked about in my previous descriptions. The CBAM in this model uses a sort of 'Variance pooling' in that it calculates the variance between features for both the channel gate and spatial gate. CMAM uses multiheaded channel attention that is currently just using separable convolutions for QKV and out projections, planning on trying the NLPool2d module I've made in place of the separable convolutions to see what happens. multihead spatial attention is just four 1x1 convs (in the code I use linear layers for 1x1 convs as they are slightly faster) followed by your typical multihead attention flow of taking softmax((Q*Kt)/d)*V
The encoders use a standard wide residual unit and a weird pooling mechanism I made that I am calling non-linear pooling, NLPool2d being the module. Basically the image tensor is unfolded into shape B,C*Kh*Kw,L where the kernel size is KhxKw and L is the number of blocks. This is transposed and reshaped to B,L,C,Kh*Kw and is then fed through a feedforward network - so Kh*Kw input neurons → widening_factor * Kh * Kw neurons followed by a nonlinearity and then widening_factor * Kh * Kw → 1. This uses a residual connection and includes a Kh*Kw → 1 layer for resizing the identity connection. This is then reshaped to the output B,C,nh,nw where nh and nw are the downsampled image size.
After pooling, every encoder makes use of either CBAM attention or CMAM attention (CBAM is used in earlier levels due to the memory consumption with CMAM).
The decoders make use of a residual unit as well as slightly modified CBAM/CMAM modules which were adapted to include skip connections for 'memory', in this case a skip connection from the corresponding encoders attention modules as with transformer decoders in natural language processing - the 'memory' here is the skip connection from the corresponding encoder. For CBAM, this skip connection is just concatenated with the input to the module - for instance, the modified channel gate will take the min, max, avg, and variance of each channel in the input, and if the skip connection is present will do the same with that and then will concatenate all outputs and send them through 1x1 convolutions to produce the channel attention values.