Why do large batch sized trainings perform poorly in SGD? - Generalization Gap Explained | AISC

Published on ● Video Link: https://www.youtube.com/watch?v=crag6bMM-0k



Duration: 5:15
1,609 views
45


5-min ML Paper Challenge
Presenter: https://www.linkedin.com/in/xiyangchen/

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
https://arxiv.org/abs/1609.04836

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32-512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.




Other Videos By LLMs Explained - Aggregate Intellect - AI.SCIENCE


2019-05-02A Framework for Developing Deep Learning Classification Models
2019-05-02Revolutionizing Diet and Health with CNN's and the Microbiome
2019-05-02Efficient implementation of a neural network on hardware using compression techniques
2019-05-02Supercharging AI with high performance distributed computing
2019-05-02Combining Satellite Imagery and machine learning to predict poverty
2019-05-02Revolutionary Deep Learning Method to Denoise EEG Brainwaves
2019-04-25[LISA] Linguistically-Informed Self-Attention for Semantic Role Labeling | AISC
2019-04-23How goodness metrics lead to undesired recommendations
2019-04-22Deep Neural Networks for YouTube Recommendation | AISC Foundational
2019-04-18[Phoenics] A Bayesian Optimizer for Chemistry | AISC Author Speaking
2019-04-18Why do large batch sized trainings perform poorly in SGD? - Generalization Gap Explained | AISC
2019-04-16Structured Neural Summarization | AISC Lunch & Learn
2019-04-11Deep InfoMax: Learning deep representations by mutual information estimation and maximization | AISC
2019-04-08ACT: Adaptive Computation Time for Recurrent Neural Networks | AISC
2019-04-04[FFJORD] Free-form Continuous Dynamics for Scalable Reversible Generative Models (Part 1) | AISC
2019-04-01[DOM-Q-NET] Grounded RL on Structured Language | AISC Author Speaking
2019-03-315-min [machine learning] paper challenge | AISC
2019-03-28[Variational Autoencoder] Auto-Encoding Variational Bayes | AISC Foundational
2019-03-25[GQN] Neural Scene Representation and Rendering | AISC
2019-03-21Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples | AISC
2019-03-18Understanding the Origins of Bias in Word Embeddings



Tags:
deep learning
machine learning
SGD
large batch traning
generalization gap
stochastic gradient descent