Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on May 31, 2020 2:27:04 PM ● Video Link: https://www.youtube.com/watch?v=q7QP_lfqnQM

Duration: 48:21

15,899 views

474

Do we really need dot-product attention? The attention mechanism is a central part of modern Transformers, mainly due to the dot-product attention mechanism. This paper changes the mechanism to remove the quadratic interaction terms and comes up with a new model, the Synthesizer. As it turns out, you can do pretty well like that!

OUTLINE:
0:00 - Intro & High Level Overview
1:00 - Abstract
2:30 - Attention Mechanism as Information Routing
5:45 - Dot Product Attention
8:05 - Dense Synthetic Attention
15:00 - Random Synthetic Attention
17:15 - Comparison to Feed-Forward Layers
22:00 - Factorization & Mixtures
23:10 - Number of Parameters
25:35 - Machine Translation & Language Modeling Experiments
36:15 - Summarization & Dialogue Generation Experiments
37:15 - GLUE & SuperGLUE Experiments
42:00 - Weight Sizes & Number of Head Ablations
47:05 - Conclusion

Paper: https://arxiv.org/abs/2005.00743
My Video on Transformers (Attention Is All You Need): https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-06-10	End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09	TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08	JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07	BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06	Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05	CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04	Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03	Learning To Classify Images Without Labels (Paper Explained)
2020-06-02	On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)
2020-06-01	Dynamics-Aware Unsupervised Discovery of Skills (Paper Explained)
2020-05-31	Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)
2020-05-30	[Code] How to use Facebook's DETR object detection algorithm in Python (Full Tutorial)
2020-05-29	GPT-3: Language Models are Few-Shot Learners (Paper Explained)
2020-05-28	DETR: End-to-End Object Detection with Transformers (Paper Explained)
2020-05-27	mixup: Beyond Empirical Risk Minimization (Paper Explained)
2020-05-26	A critical analysis of self-supervision, or what we can learn from a single image (Paper Explained)
2020-05-25	Deep image reconstruction from human brain activity (Paper Explained)
2020-05-24	Regularizing Trajectory Optimization with Denoising Autoencoders (Paper Explained)
2020-05-23	[News] The NeurIPS Broader Impact Statement
2020-05-22	When BERT Plays the Lottery, All Tickets Are Winning (Paper Explained)
2020-05-21	[News] OpenAI Model Generates Python Code

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

nlp

natural language processing

machine translation

google

attention mechanism

attention

transformer

seq2seq

bert

memory

lsh

locality sensitive hashing

reversible

revertible

flow

long sequence

Channel	Latest
Skyprince777	8 hours ago
Tsubasa Yozora Ch.	8 hours ago
USIX Pro Gaming	8 hours ago
alanzoka	14 hours ago
AnimeToons	14 hours ago
Flik's Gaming Stuff	15 hours ago
Beyond the Brick	16 hours ago
Nintendo Life	19 hours ago
IntroGameOver	19 hours ago
Badaw Gaming	20 hours ago
lugeyps3	20 hours ago
CarbotAnimations	21 hours ago
Pixelorez	21 hours ago
Primal Koopa Pictures	21 hours ago
BeastBoyShub	21 hours ago
816	21 hours ago
Chroma	22 hours ago
Unnie Cj	22 hours ago
Brecy	23 hours ago
Renzuwu	23 hours ago
Fal Oval	23 hours ago
fadd game	23 hours ago
Aezwozere	23 hours ago
눈사람	23 hours ago
Fragilistic	23 hours ago