DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=_c6A33Fg5Ns



Duration: 45:14
15,872 views
498


#deberta #bert #huggingface

DeBERTa by Microsoft is the next iteration of BERT-style Self-Attention Transformer models, surpassing RoBERTa in State-of-the-art in multiple NLP tasks. DeBERTa brings two key improvements: First, they treat content and position information separately in a new form of disentangled attention mechanism. Second, they resort to relative positional encodings throughout the base of the transformer, and provide absolute positional encodings only at the very end. The resulting model is both more accurate on downstream tasks and needs less pretraining steps to reach good accuracy. Models are also available in Huggingface and on Github.

OUTLINE:
0:00 - Intro & Overview
2:15 - Position Encodings in Transformer's Attention Mechanism
9:55 - Disentangling Content & Position Information in Attention
21:35 - Disentangled Query & Key construction in the Attention Formula
25:50 - Efficient Relative Position Encodings
28:40 - Enhanced Mask Decoder using Absolute Position Encodings
35:30 - My Criticism of EMD
38:05 - Experimental Results
40:30 - Scaling up to 1.5 Billion Parameters
44:20 - Conclusion & Comments

Paper: https://arxiv.org/abs/2006.03654
Code: https://github.com/microsoft/DeBERTa
Huggingface models: https://huggingface.co/models?search=deberta

Abstract:
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n




Other Videos By Yannic Kilcher


2021-04-07PAIR AI Explorables | Is the problem in the data? Examples on Fairness, Diversity, and Bias.
2021-03-30Machine Learning PhD Survival Guide 2021 | Advice on Topic Selection, Papers, Conferences & more!
2021-03-23Is Google Translate Sexist? Gender Stereotypes in Statistical Machine Translation
2021-03-22Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)
2021-03-16Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)
2021-03-11Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)
2021-03-06Apple or iPod??? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model! #Shorts
2021-03-05Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)
2021-02-27GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)
2021-02-26Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
2021-02-25DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)
2021-02-04Deep Networks Are Kernel Machines (Paper Explained)
2021-02-02Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
2021-01-29SingularityNET - A Decentralized, Open Market and Network for AIs (Whitepaper Explained)
2021-01-22Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021-01-17STOCHASTIC MEME DESCENT - Deep Learning Meme Review - Episode 2 (Part 2 of 2)
2021-01-12OpenAI CLIP: ConnectingText and Images (Paper Explained)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
deep learning tutorial
huggingface
huggingface transformers
microsoft
microsoft research
bert
roberta
deberta
nlp
natural language processing
glue
superglue
state of the art
transformers
attention
attention mechanism
disentanglement
disentangled representation
positional encodings
position embeddings
masked language modelling
pretraining
open source