Archspire - Bleed the Future - 08 A.U.M. (Instrumental, wip)

Subscribers:
2,350
Published on ● Video Link: https://www.youtube.com/watch?v=jMVcX9RQCbg



Duration: 3:04
1,464 views
42


Some vocals bleed through in a few places but less than normal; still training but will probably have a more finalized version in a day or two.

Might do a video on coding this, though the architecture is somewhat complex and uses concepts from both convolutional neural networks and transformers.

For anyone who cares, my model architecture has been uploaded here: https://github.com/carperbr/vocal-remover-frame-transformer (all code for this architecture is in lib/frame_transformer.py; a normal vocal remover is in nets.py and layers.py for comparison)

Have been training a new version of my model. Like the last video, this is using a weird u-net that only acts on the frequency axis - all encoder and decoder 2d convolutions use 3x1 kernels so as to only convolve features from the same frame and to retain the temporal resolution. This appears to have a drastic effect on learning. There are no 3x3 convolutions used in this architecture - for interframe communication, the transformer architecture is used - specifically a modified evolved transformer encoder and decoder architecture.

On the encoding part of the U-Net, there are evolved transformer encoders that take the input, bottleneck it to 1 channel using a 1x1 convolution, and then treat the frequency bins as the embedding. The shape is rearranged to B,W,H in order to reflect this, and is then sent through the evolved transformer encoder transposing W and H where needed for 1d convolutions (so H would be C for 1d convolutions). The only place interframe communication occurs is in the wide convolutions of the evolved transformer architecture as well as in the multihead attention module which in this project is called MultiheadFrameAttention.

There are 4 sets of encoders at 4 different frequency resolutions and 4 sets of decoders at the same frequency resolutions on the decoding process. In this sense, there are extra skip connections in this U-Net in the form of the memory in the evolved transformer decoder architecture which uses the skip connection as the memory input which is bottlenecked to B,1,H,W and used in multiheaded attention as normal.

For both input and memory in the encoder and decoder, because this is a convolutional neural network that expands the channels when downscaling the frequency, a bottleneck is required for both input and memory to project them into a HxW representation. Each resolution has two transformer encoders or decoders stacked that have their own bottlenecks which in effect allows them to iteratively pull information from all channels as needed. More layers would be able to use more information from the 2d convolutional portion of the architecture, but who knows. At each new layer in a transformer encoder/decoder stack, the layers are connected in a dense net fashion so each layers output is concatenated with the input and sent through the next layers bottleneck.

Positional encoding is weird in this model. It uses more of a distance encoding instead of positional encoding, so after Q*Kt there is a matrix of distances that is calculated (well, this is done in the constructor of the module) and there is a collection of weights for distances for each head. This allows each head to focus on general areas giving it a more big picture form of honing in on where to pay attention. I am going to try adding in relative positional embedding next to add a kind of fine tuning - I would imagine you could consider the distance encoding + distance weights as a low frequency form of positional attention and relative positional embeddings as a higher frequency form of positional attention.




Other Videos By Benjamin Carper


2022-08-04Within The Ruins - World Undone (Instrumental, work in progress)
2022-08-04Within The Ruins - Resurgence (Instrumental, work in progress)
2022-08-04AC DC - Back In Black (Instrumental, work in progress)
2022-05-04Equilibrium - Sagas - 03 Blut Im Auge (Instrumental)
2022-05-04The Zenith Passage - Solipsist - 02 Holographic Principle II - Convergence (Instrumental)
2022-04-29Sonata in B Minor (intro section, v2; neo-Baroque metal)
2022-04-26The Zenith Passage - Algorithmic Salvation (Instrumental, v3)
2022-04-26The Zenith Passage - Synaptic Depravation (Instrumental, v2)
2022-03-27Between the Buried and Me - The Parallax II Future Sequence - 02 Astral Body (Instrumental, wip)
2022-03-27Archspire - Bleed the Future - 08 A.U.M. (Vocals only, wip)
2022-03-27Archspire - Bleed the Future - 08 A.U.M. (Instrumental, wip)
2022-03-15Archspire - Bleed the Future - 05 Drain of Incarnation (Instrumental, wip v2)
2022-03-12First Fragment - Gloire Éternelle - 05 De Chair Et De Haine (v2, AI Instrumental WIP)
2022-03-12First Fragment - Gloire Éternelle - 02 Solus (AI Instrumental WIP)
2022-02-19The Devils of Loudun - Escaping Eternity - 01 The Scourge of Beasts (Instrumental, wip)
2022-01-06First Fragment - Gloire Éternelle - 09 In'el (AI Instrumental)
2022-01-05Archspire - Bleed the Future - 05 Drain of Incarnation (Instrumental, work in progress)
2021-11-23First Fragment - Gloire Éternelle (AI Instrumental, work in progress)
2021-11-16Archspire - Bleed the Future - 06 Acrid Canon (Instrumental WIP, v2)
2021-11-13Archspire - Bleed the Future - 07 Reverie on the Onyx (Instrumental WIP)
2021-11-06First Fragment - Dasein - 08 Gula (Instrumental WIP)