VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)

Subscribers:
284,000
Published on ● Video Link: https://www.youtube.com/watch?v=ZfDZRX3WiJg



Duration: 29:42
5,916 views
267


Pre-training a CNN backbone for visual transfer learning has recently seen a big push into the direction of incorporating more data, at the cost of less supervision. This paper investigates the opposite: Visual transfer learning by pre-training from very few, but very high-quality samples on an image captioning task.

OUTLINE:
0:00 - Intro & Overview
1:00 - Pre-Training for Visual Tasks
3:40 - Quality-Quantity Tradeoff
5:50 - Image Captioning
8:35 - VirTex Method
14:30 - Linear Classification
20:30 - Ablations
22:05 - Fine-Tuning
25:45 - Attention Visualization
27:30 - Conclusion & Remarks

Paper: https://arxiv.org/abs/2006.06666
Code: https://github.com/kdexd/virtex

Abstract:
The de-facto approach to many vision tasks is to start from pretrained visual representations, typically learned via supervised training on ImageNet. Recent methods have explored unsupervised pretraining to scale to vast quantities of unlabeled images. In contrast, we aim to learn high-quality visual representations from fewer images. To this end, we revisit supervised pretraining, and seek data-efficient alternatives to classification-based pretraining. We propose VirTex -- a pretraining approach using semantically dense captions to learn visual representations. We train convolutional networks from scratch on COCO Captions, and transfer them to downstream recognition tasks including image classification, object detection, and instance segmentation. On all tasks, VirTex yields features that match or exceed those learned on ImageNet -- supervised or unsupervised -- despite using up to ten times fewer images.

Authors: Karan Desai, Justin Johnson

Links:
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher




Other Videos By Yannic Kilcher


2020-06-22[Drama] Yann LeCun against Twitter on Dataset Bias
2020-06-21SIREN: Implicit Neural Representations with Periodic Activation Functions (Paper Explained)
2020-06-20Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)
2020-06-19On the Measure of Intelligence by François Chollet - Part 2: Human Priors (Paper Explained)
2020-06-18Image GPT: Generative Pretraining from Pixels (Paper Explained)
2020-06-17BYOL: Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning (Paper Explained)
2020-06-16TUNIT: Rethinking the Truly Unsupervised Image-to-Image Translation (Paper Explained)
2020-06-15A bio-inspired bistable recurrent cell allows for long-lasting memory (Paper Explained)
2020-06-14SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow
2020-06-13Deep Differential System Stability - Learning advanced computations from examples (Paper Explained)
2020-06-12VirTex: Learning Visual Representations from Textual Annotations (Paper Explained)
2020-06-11Linformer: Self-Attention with Linear Complexity (Paper Explained)
2020-06-10End-to-End Adversarial Text-to-Speech (Paper Explained)
2020-06-09TransCoder: Unsupervised Translation of Programming Languages (Paper Explained)
2020-06-08JOIN ME for the NeurIPS 2020 Flatland Multi-Agent RL Challenge!
2020-06-07BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)
2020-06-06Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search (Paper Explained)
2020-06-05CornerNet: Detecting Objects as Paired Keypoints (Paper Explained)
2020-06-04Movement Pruning: Adaptive Sparsity by Fine-Tuning (Paper Explained)
2020-06-03Learning To Classify Images Without Labels (Paper Explained)
2020-06-02On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained)



Tags:
deep learning
machine learning
arxiv
explained
neural networks
ai
artificial intelligence
paper
cnn
visual
resnet
caption
nlp
transformer
vasvani
attention
text
coco
imagenet
convolutional neural network
adaptation
transfer learning
quality
unsupervised
self-supervised