Turing-NLG, DeepSpeed and the ZeRO optimizer
Microsoft has trained a 17-billion parameter language model that achieves state-of-the-art perplexity. This video takes a look at the ZeRO optimizer that enabled this breakthrough. ZeRO allows you to do model- and data-parallelism without having huge cuts in training speed.
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
https://github.com/microsoft/DeepSpeed
https://arxiv.org/abs/1910.02054
Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher