Hopfield Networks is All You Need (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on August 9, 2020 3:51:38 PM ● Video Link: https://www.youtube.com/watch?v=nv6oFDp6rNQ

Duration: 1:05:16

79,901 views

2,295

#ai #transformer #attention

Hopfield Networks are one of the classic models of biological memory networks. This paper generalizes modern Hopfield Networks to continuous states and shows that the corresponding update rule is equal to the attention mechanism used in modern Transformers. It further analyzes a pre-trained BERT model through the lens of Hopfield Networks and uses a Hopfield Attention Layer to perform Immune Repertoire Classification.

OUTLINE:
0:00 - Intro & Overview
1:35 - Binary Hopfield Networks
5:55 - Continuous Hopfield Networks
8:15 - Update Rules & Energy Functions
13:30 - Connection to Transformers
14:35 - Hopfield Attention Layers
26:45 - Theoretical Analysis
48:10 - Investigating BERT
1:02:30 - Immune Repertoire Classification

Paper: https://arxiv.org/abs/2008.02217
Code: https://github.com/ml-jku/hopfield-layers
Immune Repertoire Classification Paper: https://arxiv.org/abs/2007.13505

My Video on Attention: https://youtu.be/iDulhoQ2pro
My Video on BERT: https://youtu.be/-9evrZnBorM

Abstract:
We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: this https URL

Authors: Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Milena Pavlović, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2020-09-13	Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess (Paper Explained)
2020-09-07	Learning to summarize from human feedback (Paper Explained)
2020-09-02	Self-classifying MNIST Digits (Paper Explained)
2020-08-28	Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (Paper Explained)
2020-08-26	Radioactive data: tracing through training (Paper Explained)
2020-08-23	Fast reinforcement learning with generalized policy updates (Paper Explained)
2020-08-20	What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (Paper Explained)
2020-08-18	[Rant] REVIEWER #2: How Peer Review is FAILING in Machine Learning
2020-08-14	REALM: Retrieval-Augmented Language Model Pre-Training (Paper Explained)
2020-08-12	Meta-Learning through Hebbian Plasticity in Random Networks (Paper Explained)
2020-08-09	Hopfield Networks is All You Need (Paper Explained)
2020-08-06	I TRAINED AN AI TO SOLVE 2+2 (w/ Live Coding)
2020-08-04	PCGRL: Procedural Content Generation via Reinforcement Learning (Paper Explained)
2020-08-02	Big Bird: Transformers for Longer Sequences (Paper Explained)
2020-07-29	Self-training with Noisy Student improves ImageNet classification (Paper Explained)
2020-07-26	[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)
2020-07-23	[Classic] ImageNet Classification with Deep Convolutional Neural Networks (Paper Explained)
2020-07-21	Neural Architecture Search without Training (Paper Explained)
2020-07-19	[Classic] Generative Adversarial Networks (Paper Explained)
2020-07-16	[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality
2020-07-14	[Classic] Deep Residual Learning for Image Recognition (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

schmidhuber

hochreiter

lstm

gru

rnn

hopfield

attention

attention is all you need

transformer

bert

query

key

value

routing

pattern

retrieval

store

error

exponental

binary

continuous

hopfield network

lse

energy function

update rule

metastable

separation

Channel	Latest
Roy The Gamer.	6 hours ago
HellfireComms	11 hours ago
penguinz0	12 hours ago
Zanar Aesthetics	13 hours ago
Svarush	14 hours ago
LongplayArchive	14 hours ago
Õhtuleht	14 hours ago
Pico Shogun	15 hours ago
Momoterasu	15 hours ago
Bass City	15 hours ago
ETwo4Three	15 hours ago
Henry Chhouk	15 hours ago
TueurDeBikette	15 hours ago
Suns	15 hours ago
Mati Clips	15 hours ago
Carlotta ASMR	16 hours ago
Shazam Sakazaki	16 hours ago
Cardboard Tube Knight	16 hours ago
ÉducaTube	16 hours ago
Jaegerchere	16 hours ago
lucas gameplays	16 hours ago
Darth Luke	16 hours ago
RobertIDK	16 hours ago
Ajarn Spencer	16 hours ago
Lazycorner07	16 hours ago