Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)

Subscribers:
291,000
Published on ● Video Link: https://www.youtube.com/watch?v=Elxn8rS88bI



Duration: 34:02
21,981 views
928


#universalcomputation #pretrainedtransformers #finetuning

Large-scale pre-training and subsequent fine-tuning is a common recipe for success with transformer models in machine learning. However, most such transfer learning is done when a model is pre-trained on the same or a very similar modality to the final task to be solved. This paper demonstrates that transformers can be fine-tuned to completely different modalities, such as from language to vision. Moreover, they demonstrate that this can be done by freezing all attention layers, tuning less than .1% of all parameters. The paper further claims that language modeling is a superior pre-training task for such cross-domain transfer. The paper goes through various ablation studies to make its point.

OUTLINE:
0:00 - Intro & Overview
2:00 - Frozen Pretrained Transformers
4:50 - Evaluated Tasks
10:05 - The Importance of Training LayerNorm
17:10 - Modality Transfer
25:10 - Network Architecture Ablation
26:10 - Evaluation of the Attention Mask
27:20 - Are FPTs Overfitting or Underfitting?
28:20 - Model Size Ablation
28:50 - Is Initialization All You Need?
31:40 - Full Model Training Overfits
32:15 - Again the Importance of Training LayerNorm
33:10 - Conclusions & Comments

Paper: https://arxiv.org/abs/2103.05247
Code: https://github.com/kzl/universal-computation

Abstract:
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, we find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.

Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n




Other Videos By Yannic Kilcher


2021-05-01DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)
2021-04-30Why AI is Harder Than We Think (Machine Learning Research Paper Explained)
2021-04-27I COOKED A RECIPE MADE BY A.I. | Cooking with GPT-3 (Don't try this at home)
2021-04-19NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (ML Research Paper Explained)
2021-04-14I BUILT A NEURAL NETWORK IN MINECRAFT | Analog Redstone Network w/ Backprop & Optimizer (NO MODS)
2021-04-11DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
2021-04-07PAIR AI Explorables | Is the problem in the data? Examples on Fairness, Diversity, and Bias.
2021-03-30Machine Learning PhD Survival Guide 2021 | Advice on Topic Selection, Papers, Conferences & more!
2021-03-23Is Google Translate Sexist? Gender Stereotypes in Statistical Machine Translation
2021-03-22Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)
2021-03-16Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)
2021-03-11Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)
2021-03-06Apple or iPod??? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model! #Shorts
2021-03-05Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)
2021-02-27GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)
2021-02-26Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
2021-02-25DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
artificial intelligence
paper
what is deep learning
deep learning tutorial
introduction to deep learning
berkeley
google brain
facebook ai research
pretrained transformers
gpt-3
huggingface
language model
fine-tuning
finetuning
out of domain generalization
universal computation
can transformers solve xor
transformer mnist
transformer cifar10
fine tuning transformer
gpt-2
pretrained language model