NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=rNkHjZtH0RQ



Duration: 34:26
36,900 views
1,527


#nfnets #deepmind #machinelearning

Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax.

OUTLINE:
0:00 - Intro & Overview
2:40 - What's the problem with BatchNorm?
11:00 - Paper contribution Overview
13:30 - Beneficial properties of BatchNorm
15:30 - Previous work: NF-ResNets
18:15 - Adaptive Gradient Clipping
21:40 - AGC and large batch size
23:30 - AGC induces implicit dependence between training samples
28:30 - Are BatchNorm's problems solved?
30:00 - Network architecture improvements
31:10 - Comparison to EfficientNet
33:00 - Conclusion & Comments

Paper: https://arxiv.org/abs/2102.06171
Code: https://github.com/deepmind/deepmind-research/tree/master/nfnets

My Video on BatchNorm: https://www.youtube.com/watch?v=OioFONrSETc
My Video on ResNets: https://www.youtube.com/watch?v=GWt6Fu05voI

ERRATA (from Lucas Beyer): "I believe you missed the main concern with "batch cheating". It's for losses that act on the full batch, as opposed to on each sample individually.
For example, triplet in FaceNet or n-pairs in CLIP. BN allows for "shortcut" solution to loss. See also BatchReNorm paper."

Abstract:
Batch normalization is a key component of most image classification models, but it has many undesirable properties stemming from its dependence on the batch size and interactions between examples. Although recent work has succeeded in training deep ResNets without normalization layers, these models do not match the test accuracies of the best batch-normalized networks, and are often unstable for large learning rates or strong data augmentations. In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free ResNets. Our smaller models match the test accuracy of an EfficientNet-B7 on ImageNet while being up to 8.7x faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%. In addition, Normalizer-Free models attain significantly better performance than their batch-normalized counterparts when finetuning on ImageNet after large-scale pre-training on a dataset of 300 million labeled images, with our best models obtaining an accuracy of 89.2%. Our code is available at this https URL deepmind-research/tree/master/nfnets

Authors: Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n




Other Videos By Yannic Kilcher


2021-03-22Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)
2021-03-16Pretrained Transformers as Universal Computation Engines (Machine Learning Research Paper Explained)
2021-03-11Yann LeCun - Self-Supervised Learning: The Dark Matter of Intelligence (FAIR Blog Post Explained)
2021-03-06Apple or iPod??? Easy Fix for Adversarial Textual Attacks on OpenAI's CLIP Model! #Shorts
2021-03-05Multimodal Neurons in Artificial Neural Networks (w/ OpenAI Microscope, Research Paper Explained)
2021-02-27GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)
2021-02-26Linear Transformers Are Secretly Fast Weight Memory Systems (Machine Learning Paper Explained)
2021-02-25DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)
2021-02-04Deep Networks Are Kernel Machines (Paper Explained)
2021-02-02Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
2021-01-29SingularityNET - A Decentralized, Open Market and Network for AIs (Whitepaper Explained)
2021-01-22Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021-01-17STOCHASTIC MEME DESCENT - Deep Learning Meme Review - Episode 2 (Part 2 of 2)
2021-01-12OpenAI CLIP: ConnectingText and Images (Paper Explained)
2021-01-06OpenAI DALL·E: Creating Images from Text (Blog Post Explained)
2020-12-26Extracting Training Data from Large Language Models (Paper Explained)
2020-12-24MEMES IS ALL YOU NEED - Deep Learning Meme Review - Episode 2 (Part 1 of 2)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
deep learning tutorial
machine learning tutorial
machine learning explained
batch normalization
jax
layer normalization
gradient clipping
weight standardization
normalizer-free
nfnets
nfnet
nfresnet
deepmind
deep mind
best neural network
imagenet
best imagenet model
distributed training
mean shift
batch norm
batchnorm
nfnets code
deep learning code
ml code