Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=q7QP_lfqnQM



Duration: 48:21
15,899 views
474


Do we really need dot-product attention? The attention mechanism is a central part of modern Transformers, mainly due to the dot-product attention mechanism. This paper changes the mechanism to remove the quadratic interaction terms and comes up with a new model, the Synthesizer. As it turns out, you can do pretty well like that!

OUTLINE:
0:00 - Intro & High Level Overview
1:00 - Abstract
2:30 - Attention Mechanism as Information Routing
5:45 - Dot Product Attention
8:05 - Dense Synthetic Attention
15:00 - Random Synthetic Attention
17:15 - Comparison to Feed-Forward Layers
22:00 - Factorization & Mixtures
23:10 - Number of Parameters
25:35 - Machine Translation & Language Modeling Experiments
36:15 - Summarization & Dialogue Generation Experiments
37:15 - GLUE & SuperGLUE Experiments
42:00 - Weight Sizes & Number of Head Ablations
47:05 - Conclusion

Paper: https://arxiv.org/abs/2005.00743
My Video on Transformers (Attention Is All You Need): https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher




Other Videos By Yannic Kilcher


2020-06-10End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03Learning To Classify Images Without Labels (Paper Explained)
2020-06-02On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)
2020-06-01Dynamics-Aware Unsupervised Discovery of Skills (Paper Explained)
2020-05-31Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)
2020-05-30[Code] How to use Facebook's DETR object detection algorithm in Python (Full Tutorial)
2020-05-29GPT-3: Language Models are Few-Shot Learners (Paper Explained)
2020-05-28DETR: End-to-End Object Detection with Transformers (Paper Explained)
2020-05-27mixup: Beyond Empirical Risk Minimization (Paper Explained)
2020-05-26A critical analysis of self-supervision, or what we can learn from a single image (Paper Explained)
2020-05-25Deep image reconstruction from human brain activity (Paper Explained)
2020-05-24Regularizing Trajectory Optimization with Denoising Autoencoders (Paper Explained)
2020-05-23[News] The NeurIPS Broader Impact Statement
2020-05-22When BERT Plays the Lottery, All Tickets Are Winning (Paper Explained)
2020-05-21[News] OpenAI Model Generates Python Code



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
nlp
natural language processing
machine translation
google
attention mechanism
attention
transformer
seq2seq
bert
memory
lsh
locality sensitive hashing
reversible
revertible
flow
long sequence