Turing-NLG, DeepSpeed and the ZeRO optimizer

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=tC01FRB0M7w



Duration: 21:18
10,268 views
298


Microsoft has trained a 17-billion parameter language model that achieves state-of-the-art perplexity. This video takes a look at the ZeRO optimizer that enabled this breakthrough. ZeRO allows you to do model- and data-parallelism without having huge cuts in training speed.

https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
https://github.com/microsoft/DeepSpeed
https://arxiv.org/abs/1910.02054

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher







Tags:
deep learning
machine learning
nlp
natural language processing
machine translation
arxiv
attention mechanism
attention
transformer
seq2seq
bert
long sequence
memory
gpt-2
Megatron
Microsoft
distributed
parallelism