Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=nxEr4VNgYOE



Duration: 30:11
4,514 views
191


Deep neural networks are large models and pruning has become an important part of ML product pipelines, making models small while keeping their performance high. However, the classic pruning method, Magnitude Pruning, is suboptimal in models that are obtained by transfer learning. This paper proposes a solution, called Movement Pruning and shows its superior performance.

OUTLINE:
0:00 - Intro & High-Level Overview
0:55 - Magnitude Pruning
4:25 - Transfer Learning
7:25 - The Problem with Magnitude Pruning in Transfer Learning
9:20 - Movement Pruning
22:20 - Experiments
24:20 - Improvements via Distillation
26:40 - Analysis of the Learned Weights

Paper: https://arxiv.org/abs/2005.07683
Code: https://github.com/huggingface/transformers/tree/master/examples/movement-pruning

Abstract:
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning; however, it is less effective in the transfer learning regime that has become standard for state-of-the-art natural language processing applications. We propose the use of movement pruning, a simple, deterministic first-order weight pruning method that is more adaptive to pretrained model fine-tuning. We give mathematical foundations to the method and compare it to existing zeroth- and first-order pruning methods. Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes. When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Authors: Victor Sanh, Thomas Wolf, Alexander M. Rush

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher




Other Videos By Yannic Kilcher


2020-06-14SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow
2020-06-13Deep Differential System Stability - Learning advanced computations from examples (Paper Explained)
2020-06-12VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)
2020-06-11Linformer: Self-Attention with Linear Complexity (Paper Explained)
2020-06-10End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03Learning To Classify Images Without Labels (Paper Explained)
2020-06-02On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)
2020-06-01Dynamics-Aware Unsupervised Discovery of Skills (Paper Explained)
2020-05-31Synthesizer: Rethinking Self-Attention in Transformer Models (Paper Explained)
2020-05-30[Code] How to use Facebook's DETR object detection algorithm in Python (Full Tutorial)
2020-05-29GPT-3: Language Models are Few-Shot Learners (Paper Explained)
2020-05-28DETR: End-to-End Object Detection with Transformers (Paper Explained)
2020-05-27mixup: Beyond Empirical Risk Minimization (Paper Explained)
2020-05-26A critical analysis of self-supervision, or what we can learn from a single image (Paper Explained)
2020-05-25Deep image reconstruction from human brain activity (Paper Explained)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
prune
pruning
transfer learning
weights
magnitude
gradient
moving
small
importance
huggingface
nlp
natural language processing
squad
mnli
bert
transformer
attention
cnn
distillation
teacher
sparse
sparsity
question answering
mobile
edge
tune
fine-tune