End-to-End Adversarial Text-to-Speech (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

301,000

Published on June 10, 2020 2:14:52 PM ● Video Link: https://www.youtube.com/watch?v=WTB2p4bqtXU

Duration: 40:49

12,167 views

416

Text-to-speech engines are usually multi-stage pipelines that transform the signal into many intermediate representations and require supervision at each step. When trying to train TTS end-to-end, the alignment problem arises: Which text corresponds to which piece of sound? This paper uses an alignment module to tackle this problem and produces astonishingly good sound.

OUTLINE:
0:00 - Intro & Overview
1:55 - Problems with Text-to-Speech
3:55 - Adversarial Training
5:20 - End-to-End Training
7:20 - Discriminator Architecture
10:40 - Generator Architecture
12:20 - The Alignment Problem
14:40 - Aligner Architecture
24:00 - Spectrogram Prediction Loss
32:30 - Dynamic Time Warping
38:30 - Conclusion

Paper: https://arxiv.org/abs/2006.03575
Website: https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech

Abstract:
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable monotonic interpolation scheme to predict the duration of each input token. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

Authors: Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, Karen Simonyan

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-06-20	Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)
2020-06-19	On the Measure of Intelligence by François Chollet - Part 2: Human Priors (Paper Explained)
2020-06-18	Image GPT: Generative Pretraining from Pixels (Paper Explained)
2020-06-17	BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)
2020-06-16	TUNIT: Rethinking the Truly Unsupervised Image-to-Image Translation (Paper Explained)
2020-06-15	A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)
2020-06-14	SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow
2020-06-13	Deep Differential System Stability - Learning advanced computations from examples (Paper Explained)
2020-06-12	VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)
2020-06-11	Linformer: Self-Attention with Linear Complexity (Paper Explained)
2020-06-10	End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09	TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08	JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07	BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06	Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05	CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04	Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03	Learning To Classify Images Without Labels (Paper Explained)
2020-06-02	On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)
2020-06-01	Dynamics-Aware Unsupervised Discovery of Skills (Paper Explained)
2020-05-31	Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

tts

text-to-speech

aligner

convolutions

spectrogram

mel

alignment

phonemes

deepmind

deep mind

dynamic time warping

gaussian kernel

adversarial

gan

discriminator

tokens

sound wave

speech

Channel	Latest
Gekisaka Game Channel	6 hours ago
Tello Godox	6 hours ago
Yannex	6 hours ago
100% WALKTHROUGH	6 hours ago
𝐌𝐢𝐧𝐝 𝐎𝐯𝐞𝐫 𝐨𝐟𝐟𝐢𝐜𝐢𝐚𝐥	6 hours ago
Limp CK	6 hours ago
Ur shivam	7 hours ago
Rayan Al-eissa	7 hours ago
GwammTM	7 hours ago
UNIQUE M79	7 hours ago
vasanth தமிழ் gaming	7 hours ago
Power Art YT	7 hours ago
Neon Gaming ID	7 hours ago
HOSTTLER 2.0	7 hours ago
officialgtvid	7 hours ago
Malayeka VT	7 hours ago
ឪអាទុយ	7 hours ago
Rusher Nitesh	7 hours ago
Misty Kathrine	7 hours ago
TEODORO	7 hours ago
かもへっぽこ	7 hours ago
BoBo Bro	7 hours ago
sen 2424	7 hours ago
Мысля Геймится	7 hours ago
ran Bundesliga	7 hours ago