Linformer: Self-Attention with Linear Complexity (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

301,000

Published on June 11, 2020 1:50:50 PM ● Video Link: https://www.youtube.com/watch?v=-_2AF9Lhweo

Duration: 50:24

27,566 views

919

Transformers are notoriously resource-intensive because their self-attention mechanism requires a squared number of memory and computations in the length of the input sequence. The Linformer Model gets around that by using the fact that often, the actual information in the attention matrix is of lower rank and can be approximated.

OUTLINE:
0:00 - Intro & Overview
1:40 - The Complexity of Self-Attention
4:50 - Embedding Dimension & Multiple Heads
8:45 - Formal Attention
10:30 - Empirical Investigation into RoBERTa
20:00 - Theorem: Self-Attention is Low Rank
28:10 - Linear Self-Attention Method
36:15 - Theorem: Linear Self-Attention
44:10 - Language Modeling
46:40 - NLP Benchmarks
47:50 - Compute Time & Memory Gains
48:20 - Broader Impact Statement
49:55 - Conclusion

Paper: https://arxiv.org/abs/2006.04768

Abstract:
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2) to O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-06-21	SIREN: Implicit Neural Representations with Periodic Activation Functions (Paper Explained)
2020-06-20	Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)
2020-06-19	On the Measure of Intelligence by François Chollet - Part 2: Human Priors (Paper Explained)
2020-06-18	Image GPT: Generative Pretraining from Pixels (Paper Explained)
2020-06-17	BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)
2020-06-16	TUNIT: Rethinking the Truly Unsupervised Image-to-Image Translation (Paper Explained)
2020-06-15	A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)
2020-06-14	SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow
2020-06-13	Deep Differential System Stability - Learning advanced computations from examples (Paper Explained)
2020-06-12	VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)
2020-06-11	Linformer: Self-Attention with Linear Complexity (Paper Explained)
2020-06-10	End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09	TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08	JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07	BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06	Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05	CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04	Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03	Learning To Classify Images Without Labels (Paper Explained)
2020-06-02	On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)
2020-06-01	Dynamics-Aware Unsupervised Discovery of Skills (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

facebook

linear

quadratic

transformer

attention

self-attention

multi-head attention

t2t

vasvani

bert

devlin

roberta

glue

language modeling

perplexity

dot product

johnson

lindenstrauss

random projection

Channel	Latest
Mrs Chim Chim	6 hours ago
GuitarHeroStyles	7 hours ago
Happy Animes Recaps	8 hours ago
gameranx	9 hours ago
Raw Fury	9 hours ago
Projeto Yukaa	9 hours ago
RobtheMod	9 hours ago
theRadBrad	9 hours ago
I Dream of Indie Games	9 hours ago
GamingBolt	9 hours ago
SlimeUwU	10 hours ago
NVIDIA GeForce	10 hours ago
Tamil Gaming தமிழ் கேமிங்	10 hours ago
Game Thoughts	11 hours ago
Graczol	11 hours ago
Exiex	11 hours ago
Kindly Keyin	11 hours ago
Family Friendly Gaming	12 hours ago
LongplayArchive	12 hours ago
Mortismal Gaming	12 hours ago
The Frustrated Gamer	12 hours ago
Gekisaka Game Channel	13 hours ago
Tello Godox	13 hours ago
Yannex	13 hours ago
100% WALKTHROUGH	13 hours ago