[T-Fixup] Improving Transformer Optimization Through Better Initialization | AISC

Published on ● Video Link: https://www.youtube.com/watch?v=EpxilvBvAeQ



Duration: 34:47
696 views
11


Speaker(s): Gary Huang
Facilitator(s): Royal Sequiera, Nour Fahmy

Find the recording, slides, and more info at https://ai.science/e/t-fixup-improving-transformer-optimization-through-better-initialization--7SFFcJpCk07bPJ3tKdMP

Motivation / Abstract
The Transformer architecture has achieved considerable success recently; the key component of the Transformer is the attention layer that enables the model to focus on important regions within an input sequence. Gradient optimization with attention layers can be notoriously difficult requiring tricks such as learning rate warmup to prevent divergence. As Transformer models are becoming larger and more expensive to train, recent research has focused on understanding and improving optimization in these architectures. In this work our contributions are two-fold: we first investigate and empirically validate the source of optimization problems in the encoder-decoder Transformer architecture; we then propose a new weight initialization scheme with theoretical justification, that enables training without warmup or layer normalization. Empirical results on public machine translation benchmarks show that our approach achieves leading accuracy, allowing to train deep Transformer models with 200 layers in both encoder and decoder (over 1000 attention/MLP blocks) without difficulty.


------
#AISC hosts 3-5 live sessions like this on various AI research, engineering, and product topics every week! Visit https://ai.science for more details




Other Videos By LLMs Explained - Aggregate Intellect - AI.SCIENCE


2020-09-10An overview of task-oriented dialog systems | AISC
2020-09-09Targeted Machine Learning for Data Science | AISC
2020-09-08Build next generation recommenders with NVIDIA Merlin | AISC
2020-09-02Principal Neighbourhood Aggregation for Graph Nets | AISC
2020-09-01DeepFakes & Explainable AI Applications in NLP, Biomedical & Malware Classification
2020-08-28AI Ethics Then & Now: A Look Back on the Last Five Years | AISC
2020-08-27Beyond Accuracy: Behavioral Testing of NLP Models with CheckList | AISC
2020-08-27The Summary Loop: Learning to Write Abstractive Summaries Without Examples + Demo | AISC
2020-08-26[MEM] Learning Permutation Invariant Representations using Memory Networks | AISC
2020-08-26AI for Fun!
2020-08-25[T-Fixup] Improving Transformer Optimization Through Better Initialization | AISC
2020-08-25A review of ML for aerospace systems health management | AISC
2020-08-21An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph | AISC
2020-08-20Overview of Synthetic Data and Simulations | AISC
2020-08-19Discovering Symbolic Inductive Biases | AISC
2020-08-19Product Ideation - Art of Finding the Right Problem to Work on! | AISC
2020-08-19Pink Diamond - Data Driven Prediction of Venture Success | Workshop Capstone
2020-08-19Review Nuggets - Mining Insight from Consumer Product Reviews | Workshop Capstone
2020-08-19Fast Film - Emotionally Aware Movie Recommender | Workshop Capstone
2020-08-19Acetock - Stock Prediction Tool for Amateur Investors | Workshop Capstone
2020-08-19Saramsh - Patent Document Summarization using BART | Workshop Capstone