Rethinking Attention with Performers (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on October 26, 2020 4:57:35 PM ● Video Link: https://www.youtube.com/watch?v=xJrKIPwVwGM

Duration: 54:39

52,464 views

1,733

#ai #research #attention

Transformers have huge memory and compute requirements because they construct an Attention matrix, which grows quadratically in the size of the input. The Performer is a model that uses random positive orthogonal features to construct an unbiased estimator to the Attention matrix and obtains an arbitrarily good approximation in linear time! The method generalizes beyond attention and opens the door to the next generation of deep learning architectures.

OUTLINE:
0:00 - Intro & Outline
6:15 - Quadratic Bottleneck in Attention Mechanisms
10:00 - Decomposing the Attention Matrix
15:30 - Approximating the Softmax Kernel
24:45 - Different Choices, Different Kernels
28:00 - Why the Naive Approach does not work!
31:30 - Better Approximation via Positive Features
36:55 - Positive Features are Infinitely Better
40:10 - Orthogonal Features are Even Better
43:25 - Experiments
49:20 - Broader Impact Statement
50:00 - Causal Attention via Prefix Sums
52:10 - Code
53:50 - Final Remarks & Conclusion

Paper: https://arxiv.org/abs/2009.14794
Code: https://github.com/google-research/google-research/tree/master/performer
Blog: https://ai.googleblog.com/2020/10/rethinking-attention-with-performers.html

Kernels on ML Street Talk: https://www.youtube.com/watch?v=y_RjsDHl5Y4
My Video on Linformer: https://www.youtube.com/watch?v=-_2AF9Lhweo
My Video on Reformer: https://www.youtube.com/watch?v=i4H0kjxrias
My Video on Attention: https://www.youtube.com/watch?v=iDulhoQ2pro

Abstract:
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

Authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2020-12-26	Extracting Training Data from Large Language Models (Paper Explained)
2020-12-24	MEMES IS ALL YOU NEED - Deep Learning Meme Review - Episode 2 (Part 1 of 2)
2020-12-16	ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)
2020-12-13	2M All-In into $5 Pot! WWYD? Daniel Negreanu's No-Limit Hold'em Challenge! (Poker Hand Analysis)
2020-12-01	DeepMind's AlphaFold 2 Explained! AI Breakthrough in Protein Folding! What we know (& what we don't)
2020-11-29	Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained)
2020-11-22	Fourier Neural Operator for Parametric Partial Differential Equations (Paper Explained)
2020-11-15	[News] Soccer AI FAILS and mixes up ball and referee's bald head.
2020-11-10	Underspecification Presents Challenges for Credibility in Modern Machine Learning (Paper Explained)
2020-11-02	Language Models are Open Knowledge Graphs (Paper Explained)
2020-10-26	Rethinking Attention with Performers (Paper Explained)
2020-10-17	LambdaNetworks: Modeling long-range Interactions without Attention (Paper Explained)
2020-10-11	Descending through a Crowded Valley -- Benchmarking Deep Learning Optimizers (Paper Explained)
2020-10-04	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Paper Explained)
2020-10-03	Training more effective learned optimizers, and using them to train themselves (Paper Explained)
2020-09-18	The Hardware Lottery (Paper Explained)
2020-09-13	Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess (Paper Explained)
2020-09-07	Learning to summarize from human feedback (Paper Explained)
2020-09-02	Self-classifying MNIST Digits (Paper Explained)
2020-08-28	Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)
2020-08-26	Radioactive data: tracing through training (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

nlp

natural language processing

natural language understanding

data science

transformer

attention

attention mechanism

transformers

attention is all you need

gpus

tpu

linformer

reformer

explanation

imagenet64

kernels

gaussian kernel

softmax

softmax kernel

approximation

random features

random positive features

random fourier features

google

favor

machine translation

Channel	Latest
MrT-Gaming	9 hours ago
The Nishant Vibe	9 hours ago
atv	9 hours ago
ConnorDawg	9 hours ago
TerraChannel / TerraFox	9 hours ago
LukePingu	9 hours ago
Taffe316	9 hours ago
RapCheck	9 hours ago
SOLO GAMER	9 hours ago
Olympus	10 hours ago
Gellar Gaiden	10 hours ago
JÚNIOR GAELZIN	10 hours ago
DIOSTAR GAMER	10 hours ago
RUTAX FREESTYLE	10 hours ago
Loster99	10 hours ago
NS_ART	10 hours ago
Power Art YT	10 hours ago
iin indra wicahya	10 hours ago
TechBag	10 hours ago
milkcat 밀캣 (밀크캣)	10 hours ago
imjinxss	10 hours ago
Gauging Gadgets	10 hours ago
Sonic Plasma	10 hours ago
JSChels	10 hours ago
Boom Logo Effects	10 hours ago