Using Transformers to mimic anyone's voice! - VALL-E (Part 1)
Edit: I realize I made some mistakes in the Encodec structure (the Quantization is actually part of the Encoder, hence VALL-E doesn't need to learn the Quantizer and the codebooks). The corrected explanation, as well as the rest of the presentation, can be found in Part 2 here: https://www.youtube.com/watch?v=JZvF1UsCWC8
VALL-E can generate audio of any (English) text from just 3 seconds of audio sample. We will dissect the technology behind it, how it works, and also discuss whether the Transformer architecture is suitable for audio generation.
Special discussion with Tim Scarfe too! Thanks for coming! Support his podcast, Machine Learning Street Talk for more discussion on ML and AI advances: https://www.youtube.com/c/MachineLearningStreetTalk
Paper: https://valle-demo.github.io/
Related Papers (Using Neural Encoders and Decoders for Audio Encoding/Decoding - Neural Audio Codecs):
Encodec: https://arxiv.org/abs/2210.13438
Soundstream (first architecture to use Residual Vector Quantization (RVQ)): https://arxiv.org/abs/2107.03312
VQ-VAE (More elaboration on Vector Quantization): https://arxiv.org/pdf/1711.00937.pdf
Processing in time domain:
WaveNet: https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio
Wav2Vec: https://arxiv.org/pdf/1904.05862.pdf
~~~~
0:00 Introduction
3:54 Why it works
7:27 How to represent sound
20:30 Comparison between normal systems and VALL-E
22:07 Large Data
26:12 Data Representation
31:03 Fixed bias helps to speed up learning!
34:52 Discussion on Encodec
1:10:58 Is tokenisation in VALL-E good?
1:16:57 Can Transformers be used for any domain?
1:19:53 Various losses in Encodec
1:24:22 Is the Encodec doing part-whole hierarchy?
1:28:48 How to adapt VALL-E take in text prompts to condition speaker information?
1:31:25 Do language models understand?
1:37:11 Mel Spectrogram
1:49:06 Why is Mel Spectrogram still used in modern architectures?
1:52:14 Bias in Structure vs Loss Function
~~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/fXCZCPYs
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/.
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin