Big Bird: Transformers for Longer Sequences (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on August 2, 2020 7:49:12 AM ● Video Link: https://www.youtube.com/watch?v=WVPE62Gk3EM

Duration: 34:30

21,559 views

780

#ai #nlp #attention

The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention. Not only does this allow the processing of longer sequences, translating to state-of-the-art experimental results, but also the paper shows that BigBird comes with theoretical guarantees of universal approximation and turing completeness.

OUTLINE:
0:00 - Intro & Overview
1:50 - Quadratic Memory in Full Attention
4:55 - Architecture Overview
6:35 - Random Attention
10:10 - Window Attention
13:45 - Global Attention
15:40 - Architecture Summary
17:10 - Theoretical Result
22:00 - Experimental Parameters
25:35 - Structured Block Computations
29:30 - Recap
31:50 - Experimental Results
34:05 - Conclusion

Paper: https://arxiv.org/abs/2007.14062

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM
My Video on Longformer: https://youtu.be/_8KNb5iqblE
... and its memory requirements: https://youtu.be/gJR28onlqzs

Abstract:
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Authors: Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2020-08-28	Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)
2020-08-26	Radioactive data: tracing through training (Paper Explained)
2020-08-23	Fast reinforcement learning with generalized policy updates (Paper Explained)
2020-08-20	What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Paper Explained)
2020-08-18	[Rant] REVIEWER #2: How Peer Review is FAILING in Machine Learning
2020-08-14	REALM: Retrieval-Augmented Language Model Pre-Training (Paper Explained)
2020-08-12	Meta-Learning through Hebbian Plasticity in Random Networks (Paper Explained)
2020-08-09	Hopfield Networks is All You Need (Paper Explained)
2020-08-06	I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)
2020-08-04	PCGRL: Procedural Content Generation via Reinforcement Learning (Paper Explained)
2020-08-02	Big Bird: Transformers for Longer Sequences (Paper Explained)
2020-07-29	Self-training with Noisy Student improves ImageNet classification (Paper Explained)
2020-07-26	[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)
2020-07-23	[Classic] ImageNet Classification with Deep Convolutional Neural Networks (Paper Explained)
2020-07-21	Neural Architecture Search without Training (Paper Explained)
2020-07-19	[Classic] Generative Adversarial Networks (Paper Explained)
2020-07-16	[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality
2020-07-14	[Classic] Deep Residual Learning for Image Recognition (Paper Explained)
2020-07-12	I'M TAKING A BREAK... (Channel Update July 2020)
2020-07-11	Deep Ensembles: A Loss Landscape Perspective (Paper Explained)
2020-07-10	Gradient Origin Networks (Paper Explained w/ Live Coding)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

google

google research

bigbird

big bird

bert

attention

attention is all you need

longformer

random attention

quadratic attention

attention mechanism

natural questions

hotpot qa

genomics

nlp

natural language processing

transformer

transformers

fully connected

sparse attention

graph

star graph

turing complete

universal approximation

window attention

convolution

Channel	Latest
Roy The Gamer.	6 hours ago
HellfireComms	11 hours ago
penguinz0	12 hours ago
Zanar Aesthetics	13 hours ago
Svarush	14 hours ago
LongplayArchive	14 hours ago
Õhtuleht	14 hours ago
Pico Shogun	14 hours ago
Momoterasu	15 hours ago
Bass City	15 hours ago
ETwo4Three	15 hours ago
Henry Chhouk	15 hours ago
TueurDeBikette	15 hours ago
Suns	15 hours ago
Mati Clips	15 hours ago
Carlotta ASMR	15 hours ago
Shazam Sakazaki	15 hours ago
Cardboard Tube Knight	15 hours ago
ÉducaTube	16 hours ago
Jaegerchere	16 hours ago
lucas gameplays	16 hours ago
Darth Luke	16 hours ago
RobertIDK	16 hours ago
Ajarn Spencer	16 hours ago
Lazycorner07	16 hours ago