Deep Ensembles: A Loss Landscape Perspective (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on July 11, 2020 12:57:17 PM ● Video Link: https://www.youtube.com/watch?v=5IRlUVrEVL8

Duration: 46:32

20,558 views

848

#ai #research #optimization

Deep Ensembles work surprisingly well for improving the generalization capabilities of deep neural networks. Surprisingly, they outperform Bayesian Networks, which are - in theory - doing the same thing. This paper investigates how Deep Ensembles are especially suited to capturing the non-convex loss landscape of neural networks.

OUTLINE:
0:00 - Intro & Overview
2:05 - Deep Ensembles
4:15 - The Solution Space of Deep Networks
7:30 - Bayesian Models
9:00 - The Ensemble Effect
10:25 - Experiment Setup
11:30 - Solution Equality While Training
19:40 - Tracking Multiple Trajectories
21:20 - Similarity of Independent Solutions
24:10 - Comparison to Baselines
30:10 - Weight Space Cross-Sections
35:55 - Diversity vs Accuracy
41:00 - Comparing Ensembling Methods
44:55 - Conclusion & Comments

Paper: https://arxiv.org/abs/1912.02757

Abstract:
Deep ensembles have been empirically shown to be a promising approach for improving accuracy, uncertainty and out-of-distribution robustness of deep learning models. While deep ensembles were theoretically motivated by the bootstrap, non-bootstrap ensembles trained with just random initialization also perform well in practice, which suggests that there could be other explanations for why deep ensembles work well. Bayesian neural networks, which learn distributions over the parameters of the network, are theoretically well-motivated by Bayesian principles, but do not perform as well as deep ensembles in practice, particularly under dataset shift. One possible explanation for this gap between theory and practice is that popular scalable variational Bayesian methods tend to focus on a single mode, whereas deep ensembles tend to explore diverse modes in function space. We investigate this hypothesis by building on recent work on understanding the loss landscape of neural networks and adding our own exploration to measure the similarity of functions in the space of predictions. Our results show that random initializations explore entirely different modes, while functions along an optimization trajectory or sampled from the subspace thereof cluster within a single mode predictions-wise, while often deviating significantly in the weight space. Developing the concept of the diversity--accuracy plane, we show that the decorrelation power of random initializations is unmatched by popular subspace sampling methods. Finally, we evaluate the relative effects of ensembling, subspace based methods and ensembles of subspace based methods, and the experimental results validate our hypothesis.

Authors: Stanislav Fort, Huiyi Hu, Balaji Lakshminarayanan

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher

Other Videos By Yannic Kilcher

2020-08-04	PCGRL: Procedural Content Generation via Reinforcement Learning (Paper Explained)
2020-08-02	Big Bird: Transformers for Longer Sequences (Paper Explained)
2020-07-29	Self-training with Noisy Student improves ImageNet classification (Paper Explained)
2020-07-26	[Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)
2020-07-23	[Classic] ImageNet Classification with Deep Convolutional Neural Networks (Paper Explained)
2020-07-21	Neural Architecture Search without Training (Paper Explained)
2020-07-19	[Classic] Generative Adversarial Networks (Paper Explained)
2020-07-16	[Classic] Word2Vec: Distributed Representations of Words and Phrases and their Compositionality
2020-07-14	[Classic] Deep Residual Learning for Image Recognition (Paper Explained)
2020-07-12	I'M TAKING A BREAK... (Channel Update July 2020)
2020-07-11	Deep Ensembles: A Loss Landscape Perspective (Paper Explained)
2020-07-10	Gradient Origin Networks (Paper Explained w/ Live Coding)
2020-07-09	NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)
2020-07-08	Addendum for Supermasks in Superposition: A Closer Look (Paper Explained)
2020-07-07	SupSup: Supermasks in Superposition (Paper Explained)
2020-07-06	[Live Machine Learning Research] Plain Self-Ensembles (I actually DISCOVER SOMETHING) - Part 1
2020-07-05	SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization (Paper Explained)
2020-07-04	Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)
2020-07-03	On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained)
2020-07-02	BERTology Meets Biology: Interpreting Attention in Protein Language Models (Paper Explained)
2020-07-01	GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

ensembles

bayesian

modes

loss function

nonconvex

google

deepmind

stan fort

foundational

weight space

labels

agreement

minima

loss landscape

trajectory

local minima

optimization

Channel	Latest
GuitarHeroStyles	8 hours ago
Top5Gaming	8 hours ago
MrDalekJD	9 hours ago
gameranx	10 hours ago
Olexa	10 hours ago
dakblake	10 hours ago
TG Plays	11 hours ago
Markiplier	11 hours ago
RobtheMod	12 hours ago
MrT-Gaming	12 hours ago
The Nishant Vibe	12 hours ago
atv	12 hours ago
ConnorDawg	13 hours ago
TerraChannel / TerraFox	13 hours ago
LukePingu	13 hours ago
Taffe316	13 hours ago
RapCheck	13 hours ago
SOLO GAMER	13 hours ago
Olympus	13 hours ago
Gellar Gaiden	13 hours ago
JÚNIOR GAELZIN	13 hours ago
The Game Archivist	13 hours ago
DIOSTAR GAMER	13 hours ago
RUTAX FREESTYLE	13 hours ago
Loster99	13 hours ago