Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on July 4, 2020 12:39:13 PM ● Video Link: https://www.youtube.com/watch?v=hAooAOFRsYc

Duration: 48:06

23,602 views

814

#ai #attention #transformer #deeplearning

Transformers are famous for two things: Their superior performance and their insane requirements of compute and memory. This paper reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs.

OUTLINE:
0:00 - Intro & Overview
1:35 - Softmax Attention & Transformers
8:40 - Quadratic Complexity of Softmax Attention
9:40 - Generalized Attention Mechanism
13:45 - Kernels
20:40 - Linear Attention
25:20 - Experiments
28:30 - Intuition on Linear Attention
33:55 - Connecting Autoregressive Transformers and RNNs
41:30 - Caveats with the RNN connection
46:00 - More Results & Conclusion

Paper: https://arxiv.org/abs/2006.16236
Website: https://linear-transformers.com/
Code: https://github.com/idiap/fast-transformers

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from (N2) to (N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

Authors: Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-07-16	[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality
2020-07-14	[Classic] Deep Residual Learning for Image Recognition (Paper Explained)
2020-07-12	I'M TAKING A BREAK... (Channel Update July 2020)
2020-07-11	Deep Ensembles: A Loss Landscape Perspective (Paper Explained)
2020-07-10	Gradient Origin Networks (Paper Explained w/ Live Coding)
2020-07-09	NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)
2020-07-08	Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)
2020-07-07	SupSup: Supermasks in Superposition (Paper Explained)
2020-07-06	[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1
2020-07-05	SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)
2020-07-04	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
2020-07-03	On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)
2020-07-02	BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
2020-07-01	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
2020-06-30	Object-Centric Learning with Slot Attention (Paper Explained)
2020-06-29	Set Distribution Networks: a Generative Model for Sets of Images (Paper Explained)
2020-06-28	Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)
2020-06-27	Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures (Paper Explained)
2020-06-26	On the Measure of Intelligence by François Chollet - Part 3: The Math (Paper Explained)
2020-06-25	Discovering Symbolic Models from Deep Learning with Inductive Biases (Paper Explained)
2020-06-24	How I Read a Paper: Facebook's DETR (Video Tutorial)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

nlp

natural language processing

attention

attention mechanism

linear

linear transformer

linformer

reformer

idiap

epfl

queries

keys

softmax

kernel

routing

inner product

rnn

recurrent neural network

transformer

bert

autoregressive

dimensions

topic modeling

language model

Channel	Latest
fadd game	6 hours ago
눈사람	6 hours ago
akitokid 青色夜想曲	6 hours ago
soydianagames	6 hours ago
상상상상	6 hours ago
Lucivius	6 hours ago
Ruckquez Nd Stuff	6 hours ago
野武士ノディー	6 hours ago
Reap	6 hours ago
ありなみパイセン	6 hours ago
69SportTV	6 hours ago
잡기사	7 hours ago
El Canal de JONHEEP	7 hours ago
SAEROS ID	7 hours ago
Sharan K.E	7 hours ago
Ding Gamer	7 hours ago
myco Sports	7 hours ago
LINGGA CHANNEL	7 hours ago
Julian Official	7 hours ago
Guangzhou EPARK Electronic Technology Co., Ltd.	7 hours ago
Zoom Pirata	7 hours ago
Jokes from Nova Prikol	7 hours ago
Ahmad Ansari	7 hours ago
OPEN TV	7 hours ago
Scyte	7 hours ago