GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on July 1, 2020 2:21:03 PM ● Video Link: https://www.youtube.com/watch?v=1VdEw_mGjFk

Duration: 1:13:04

14,837 views

469

Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!

OUTLINE:
0:00 - Intro & Overview
4:10 - Main Results
5:10 - Mixture-of-Experts
16:00 - Difference to Scaling Classic Transformers
18:50 - Backpropagation in Mixture-of-Experts
20:05 - MoE Routing Algorithm in GShard
38:20 - GShard Einsum Examples
47:40 - Massively Multilingual Translation
56:00 - Results
1:11:30 - Conclusion & Comments

ERRATA:
I said the computation of MoE scales linearly, but actually, it's sub(!)-linear.

Paper: https://arxiv.org/abs/2006.16668

Abstract:
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Authors:
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-07-11	Deep Ensembles: A Loss Landscape Perspective (Paper Explained)
2020-07-10	Gradient Origin Networks (Paper Explained w/ Live Coding)
2020-07-09	NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)
2020-07-08	Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)
2020-07-07	SupSup: Supermasks in Superposition (Paper Explained)
2020-07-06	[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1
2020-07-05	SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)
2020-07-04	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
2020-07-03	On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)
2020-07-02	BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
2020-07-01	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
2020-06-30	Object-Centric Learning with Slot Attention (Paper Explained)
2020-06-29	Set Distribution Networks: a Generative Model for Sets of Images (Paper Explained)
2020-06-28	Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection (Paper Explained)
2020-06-27	Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures (Paper Explained)
2020-06-26	On the Measure of Intelligence by François Chollet - Part 3: The Math (Paper Explained)
2020-06-25	Discovering Symbolic Models from Deep Learning with Inductive Biases (Paper Explained)
2020-06-24	How I Read a Paper: Facebook's DETR (Video Tutorial)
2020-06-23	RepNet: Counting Out Time - Class Agnostic Video Repetition Counting in the Wild (Paper Explained)
2020-06-22	[Drama] Yann LeCun against Twitter on Dataset Bias
2020-06-21	SIREN: Implicit Neural Representations with Periodic Activation Functions (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

nlp

billion

parameters

float32

attention mechanism

transformer

scale

gpt-3

google

gshard

xla

sharding

parallelism

mixture of experts

trillion

tpus

distributed

multilingual translation

natural language processing

Channel	Latest
fadd game	6 hours ago
눈사람	6 hours ago
akitokid 青色夜想曲	6 hours ago
상상상상	6 hours ago
Ruckquez Nd Stuff	6 hours ago
野武士ノディー	6 hours ago
Reap	6 hours ago
ありなみパイセン	6 hours ago
69SportTV	6 hours ago
잡기사	6 hours ago
El Canal de JONHEEP	6 hours ago
SAEROS ID	7 hours ago
Sharan K.E	7 hours ago
Ding Gamer	7 hours ago
myco Sports	7 hours ago
LINGGA CHANNEL	7 hours ago
Julian Official	7 hours ago
Guangzhou EPARK Electronic Technology Co., Ltd.	7 hours ago
Zoom Pirata	7 hours ago
Jokes from Nova Prikol	7 hours ago
Ahmad Ansari	7 hours ago
OPEN TV	7 hours ago
Scyte	7 hours ago
慶饅頭	7 hours ago
JUNJUNTV	7 hours ago