Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on February 26, 2021 4:12:24 PM ● Video Link: https://www.youtube.com/watch?v=RSSVWpBak6s

Duration: 51:38

17,118 views

519

#fastweights #deeplearning #transformers

Transformers are dominating Deep Learning, but their quadratic memory and compute requirements make them expensive to train and hard to use. Many papers have attempted to linearize the core module: the attention mechanism, using kernels - for example, the Performer. However, such methods are either not satisfactory or have other downsides, such as a reliance on random features. This paper establishes an intrinsic connection between linearized (kernel) attention and the much older Fast Weight Memory Systems, in part popularized by Jürgen Schmidhuber in the 90s. It shows the fundamental limitations of these algorithms and suggests new update rules and new kernels in order to fix these problems. The resulting model compares favorably to Performers on key synthetic experiments and real-world tasks.

OUTLINE:
0:00 - Intro & Overview
1:40 - Fast Weight Systems
7:00 - Distributed Storage of Symbolic Values
12:30 - Autoregressive Attention Mechanisms
18:50 - Connecting Fast Weights to Attention Mechanism
22:00 - Softmax as a Kernel Method (Performer)
25:45 - Linear Attention as Fast Weights
27:50 - Capacity Limitations of Linear Attention
29:45 - Synthetic Data Experimental Setup
31:50 - Improving the Update Rule
37:30 - Deterministic Parameter-Free Projection (DPFP) Kernel
46:15 - Experimental Results
50:50 - Conclusion & Comments

Paper: https://arxiv.org/abs/2102.11174
Code: https://github.com/ischlag/fast-weight-transformers
Machine Learning Street Talk on Kernels: https://youtu.be/y_RjsDHl5Y4

Abstract:
We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early '90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants. With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a new kernel function to linearise attention, balancing simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.

Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2021-04-11	DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
2021-04-07	PAIR AI Explorables \| Is the problem in the data? Examples on Fairness, Diversity, and Bias.
2021-03-30	Machine Learning PhD Survival Guide 2021 \| Advice on Topic Selection, Papers, Conferences & more!
2021-03-23	Is Google Translate Sexist? Gender Stereotypes in Statistical Machine Translation
2021-03-22	Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)
2021-03-16	Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)
2021-03-11	Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)
2021-03-06	Apple or iPod??? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model! #Shorts
2021-03-05	Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)
2021-02-27	GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)
2021-02-26	Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
2021-02-25	DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19	Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17	TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14	NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11	Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)
2021-02-04	Deep Networks Are Kernel Machines (Paper Explained)
2021-02-02	Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
2021-01-29	SingularityNET - A Decentralized, Open Market and Network for AIs (Whitepaper Explained)
2021-01-22	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021-01-17	STOCHASTIC MEME DESCENT - Deep Learning Meme Review - Episode 2 (Part 2 of 2)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

fast weights

fast weights hinton

fast weights neural network

schmidhuber

jürgen schmidhuber

juergen schmidhuber

lstm transformer

performers

transformer performer

linear transformer

linear attention

linear attention transformer

autoregressive model

autoregressive transformer

transformer kernel

kernels transformer

favor performer

favor algorithm

deep learning tutorial

Channel	Latest
hikari no lunia	6 hours ago
てなてな	7 hours ago
Goraku Games	7 hours ago
Murkedd	7 hours ago
Phinia ch.	8 hours ago
MDTechVideos	8 hours ago
昇龍拳	8 hours ago
Maddish_Radish	9 hours ago
Anime SS Gaming	9 hours ago
KuroAkiGames	9 hours ago
tekken ASMR	9 hours ago
Doink - Movie Reactions & MORE	9 hours ago
Keeganchu	10 hours ago
TEKKEN GUJRANWALA	10 hours ago
Toad in the Hole 🍄	10 hours ago
BANKAI x TENSHOU	10 hours ago
BeAPro FGC	10 hours ago
Phước Phè Phỡn	10 hours ago
ゲーマーおじさん太助ン家	10 hours ago
spragels	10 hours ago
Abyss Breakdown	11 hours ago
ZombfectedGaming 349	11 hours ago
Garyx Official	11 hours ago
DavidTheBaum	11 hours ago
alsonic24	11 hours ago