New model finally. Still a work in progress, definitely some vocals still coming through, will be training a version which is twice as deep as this next - have preliminary tests with even bigger models and they definitely learn, just take longer. Have done a rewrite of the architecture yet again. This model uses a deep residual u-net, specifically its 126 convolutional layers. The bridge of the u-net uses what I'm calling multihead attention residual units. Basically its just a wide residual unit and before adding there is a convolutional multihead attention module, which itself consists of two multihead attention modules, one that computes attention with regards to channels and a second that computes attention with regards to features. The channel attention uses 9x9 shared kernel convolutions to project all channels into a shared space and contextualize each feature, however the output is just a 9x9 separable convolution. This goes directly into the feature multiheaded attention, which uses 1x1 convolutions to project features into a shared space. Both attention modules use embeddings for positional encoding as well; obviously inspired by transformers, but not sure you could call this a transformer... Seems to work pretty well and allows for an enormous neural network compared to my previous versions which trains faster and achieves higher quality.