Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on February 2, 2021 3:36:12 PM ● Video Link: https://www.youtube.com/watch?v=zdb8MM94A5c

Duration: 43:51

14,719 views

519

#ai #science #transformers

Autoregressive Transformers have taken over the world of Language Modeling (GPT-3). However, in order to train them, people use causal masking and sample parallelism, which means computation only happens in a feedforward manner. This results in higher layer information, which would be available, to not be used in the lower layers of subsequent tokens, and leads to a loss in the computational capabilities of the overall model. Feedback Transformers trade-off training speed for access to these representations and demonstrate remarkable improvements in complex reasoning and long-range dependency tasks.

OUTLINE:
0:00 - Intro & Overview
1:55 - Problems of Autoregressive Processing
3:30 - Information Flow in Recurrent Neural Networks
7:15 - Information Flow in Transformers
9:10 - Solving Complex Computations with Neural Networks
16:45 - Causal Masking in Transformers
19:00 - Missing Higher Layer Information Flow
26:10 - Feedback Transformer Architecture
30:00 - Connection to Attention-RNNs
36:00 - Formal Definition
37:05 - Experimental Results
43:10 - Conclusion & Comments

Paper: https://arxiv.org/abs/2002.09402

My video on Attention: https://youtu.be/iDulhoQ2pro

ERRATA: Sometimes I say "Switch Transformer" instead of "Feedback Transformer". Forgive me :)

Abstract:
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

Authors: Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2021-03-06	Apple or iPod??? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model! #Shorts
2021-03-05	Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)
2021-02-27	GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)
2021-02-26	Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
2021-02-25	DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19	Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17	TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14	NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11	Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)
2021-02-04	Deep Networks Are Kernel Machines (Paper Explained)
2021-02-02	Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
2021-01-29	SingularityNET - A Decentralized, Open Market and Network for AIs (Whitepaper Explained)
2021-01-22	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021-01-17	STOCHASTIC MEME DESCENT - Deep Learning Meme Review - Episode 2 (Part 2 of 2)
2021-01-12	OpenAI CLIP: ConnectingText and Images (Paper Explained)
2021-01-06	OpenAI DALL·E: Creating Images from Text (Blog Post Explained)
2020-12-26	Extracting Training Data from Large Language Models (Paper Explained)
2020-12-24	MEMES IS ALL YOU NEED - Deep Learning Meme Review - Episode 2 (Part 1 of 2)
2020-12-16	ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)
2020-12-13	2M All-In into $5 Pot! WWYD? Daniel Negreanu's No-Limit Hold'em Challenge! (Poker Hand Analysis)
2020-12-01	DeepMind's AlphaFold 2 Explained! AI Breakthrough in Protein Folding! What we know (& what we don't)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

transformer

rnn

lstm

seq2seq

gpt3

gpt-3

nlp

natural language processing

language modelling

feedback transformers

memory

attention

attention mechanism

attention is all you need

facebook ai

fair

long range

complex

reasoning

bert

autoregressive

reinforcement learning

abstraction

representation

higher layers

attention matrix

recurrent neural networks

Channel	Latest
YaBoyRoshi	9 hours ago
Play Nintendo	10 hours ago
Steam	10 hours ago
PopCross Studios	11 hours ago
Kage848	13 hours ago
Flik's Gaming Stuff	13 hours ago
ArCanOMG	13 hours ago
Sony	13 hours ago
TheREALRandomLozzie!!	14 hours ago
RTGame	16 hours ago
ForceCommander	16 hours ago
Dawko	17 hours ago
MKIceAndFire	17 hours ago
IntroGameOver	17 hours ago
Badaw Gaming	17 hours ago
alanzoka	18 hours ago
oGVexx	18 hours ago
CarbotAnimations	19 hours ago
Akashi	20 hours ago
BanryuTV	20 hours ago
Icehiteru	21 hours ago
raocow	21 hours ago
Grimith	23 hours ago
Caner Akçay	1 day ago
whitemoca	1 day ago