SupSup: Supermasks in Superposition (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on July 7, 2020 5:16:53 PM ● Video Link: https://www.youtube.com/watch?v=3jT1qJ8ETzk

Duration: 1:00:07

7,730 views

257

Supermasks are binary masks of a randomly initialized neural network that result in the masked network performing well on a particular task. This paper considers the problem of (sequential) Lifelong Learning and trains one Supermask per Task, while keeping the randomly initialized base network constant. By minimizing the output entropy, the system can automatically derive the Task ID of a data point at inference time and distinguish up to 2500 tasks automatically.

OUTLINE:
0:00 - Intro & Overview
1:20 - Catastrophic Forgetting
5:20 - Supermasks
9:35 - Lifelong Learning using Supermasks
11:15 - Inference Time Task Discrimination by Entropy
15:05 - Mask Superpositions
24:20 - Proof-of-Concept, Task Given at Inference
30:15 - Binary Maximum Entropy Search
32:00 - Task Not Given at Inference
37:15 - Task Not Given at Training
41:35 - Ablations
45:05 - Superfluous Neurons
51:10 - Task Selection by Detecting Outliers
57:40 - Encoding Masks in Hopfield Networks
59:40 - Conclusion

Paper: https://arxiv.org/abs/2006.14769
Code: https://github.com/RAIVNLab/supsup

My Video about Lottery Tickets: https://youtu.be/ZVVnvZdUMUk
My Video about Supermasks: https://youtu.be/jhCInVFE2sc

Abstract:
We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network.

Authors: Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu, Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosinski, Ali Farhadi

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher

Other Videos By Yannic Kilcher

2020-07-23	[Classic] ImageNet Classification with Deep Convolutional Neural Networks (Paper Explained)
2020-07-21	Neural Architecture Search without Training (Paper Explained)
2020-07-19	[Classic] Generative Adversarial Networks (Paper Explained)
2020-07-16	[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality
2020-07-14	[Classic] Deep Residual Learning for Image Recognition (Paper Explained)
2020-07-12	I'M TAKING A BREAK... (Channel Update July 2020)
2020-07-11	Deep Ensembles: A Loss Landscape Perspective (Paper Explained)
2020-07-10	Gradient Origin Networks (Paper Explained w/ Live Coding)
2020-07-09	NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)
2020-07-08	Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)
2020-07-07	SupSup: Supermasks in Superposition (Paper Explained)
2020-07-06	[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1
2020-07-05	SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)
2020-07-04	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
2020-07-03	On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)
2020-07-02	BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
2020-07-01	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
2020-06-30	Object-Centric Learning with Slot Attention (Paper Explained)
2020-06-29	Set Distribution Networks: a Generative Model for Sets of Images (Paper Explained)
2020-06-28	Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)
2020-06-27	Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

supsup

supermasks

lottery ticket

lottery ticket hypothesis

gradient

entropy

surplus

superfluous neurons

lifelong learning

multitask learning

catastrophic forgetting

continuous learning

binary mask

random network

optimization

hopfield network

gradient descent

superposition

Channel	Latest
Pixelorez	9 hours ago
Chroma	10 hours ago
Unnie Cj	10 hours ago
Brecy	10 hours ago
Renzuwu	10 hours ago
Fal Oval	10 hours ago
fadd game	10 hours ago
Aezwozere	10 hours ago
눈사람	10 hours ago
Fragilistic	10 hours ago
akitokid 青色夜想曲	11 hours ago
soydianagames	11 hours ago
상상상상	11 hours ago
Lucivius	11 hours ago
Ruckquez Nd Stuff	11 hours ago
野武士ノディー	11 hours ago
fan komar	11 hours ago
Tiago Vanz	11 hours ago
Reap	11 hours ago
ありなみパイセン	11 hours ago
69SportTV	11 hours ago
CHINGLAI HUNTER	11 hours ago
잡기사	11 hours ago
El Canal de JONHEEP	11 hours ago
SAEROS ID	11 hours ago