Understanding Knowledge Distillation in Neural Sequence Generation

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=5qmMvFRaEig



Duration: 1:18:06
5,275 views
79


Sequence-level knowledge distillation (KD) -- learning a student model with targets decoded from a pre-trained teacher model -- has been widely used in sequence generation applications (e.g. model compression, non-autoregressive translation (NAT), low-resource translation, etc). However, the underlying reasons behind this success have, as of yet, been unclear. In this talk, we will try to tackle the understanding of KD particularly in two scenarios: (1) Learning a weak student from a strong teacher model while keeping the same parallel data used for training the teacher; (2) Learning a student from a teacher model of equal size while the targets are generated from additional monolingual data.

Talk slides: https://www.microsoft.com/en-us/research/uploads/prod/2020/01/Understanding-Knowledge-Distillation-in-Neural-Sequence-Generation.pdf

See more on this and other talks at Microsoft Research: https://www.microsoft.com/en-us/research/video/understanding-knowledge-distillation-in-neural-sequence-generation/




Other Videos By Microsoft Research


2020-02-21Information Agents: Directions and Futures (2001)
2020-02-19Democratizing data, thinking backwards and setting North Star goals with Dr. Donald Kossmann
2020-02-19Behind the scenes on Team Explorer’s practice run at Microsoft for the DARPA SubT Urban Challenge
2020-02-12Microsoft Scheduler and dawn of Intelligent PDAs with Dr. Pamela Bhattacharya | Podcast
2020-02-05Responsible AI with Dr. Saleema Amershi | Podcast
2020-02-03Perspectives on Cross-Validation
2020-01-30Data Science Summer School 2019 - Replicating "An Empirical Analysis of Racial Differences in Po..."
2020-01-29Going deep on deep learning with Dr. Jianfeng Gao | Podcast
2020-01-22Innovating in India with Dr. Sriram Rajamani [Podcast]
2020-01-17Underestimating the challenge of cognitive disabilities (and digital literacy)
2020-01-17Understanding Knowledge Distillation in Neural Sequence Generation
2020-01-17'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project
2020-01-07Private AI Bootcamp Keynote – Sreekanth Kannepalli
2020-01-07Introduction to CKKS (Approximate Homomorphic Encryption)
2020-01-07Private AI Bootcamp Competition: Team 3
2020-01-07Conversations Based on Search Engine Result Pages
2020-01-07The Ethical Algorithm
2020-01-07Efficient Forward Architecture Search
2020-01-03Fireside Chat with Anca Dragan
2020-01-02Precision Health Engineering - Designing Health-Centered Mundane Technology to Increase Adherence
2019-12-30Checkpointing the Un-checkpointable: the Split-Process Approach for MPI and Formal Verification



Tags:
Knowledge Distillation
Neural Sequence Generation
knowledge distillation
sequence generation applications
model compression
non-autoregressive translation
Akiko Eriguchi
Jiatao Gu
Microsoft Research