OpenAI CLIP: ConnectingText and Images (Paper Explained)

Channel:

Yannic Kilcher

Subscribers:

300,000

Published on January 12, 2021 2:52:03 PM ● Video Link: https://www.youtube.com/watch?v=T9XSU0pKX2E

Duration: 48:07

93,284 views

2,388

#ai #openai #technology

Paper Title: Learning Transferable Visual Models From Natural Language Supervision
CLIP trains on 400 million images scraped from the web, along with text descriptions to learn a model that can connect the two modalities. The core idea is a contrastive objective combined with a large batch size. The resulting model can be turned into arbitrary zero-shot classifiers for new image & text tasks.

OUTLINE:
0:00 - Introduction
3:15 - Overview
4:40 - Connecting Images & Text
9:00 - Building Zero-Shot Classifiers
14:40 - CLIP Contrastive Training Objective
22:25 - Encoder Choices
25:00 - Zero-Shot CLIP vs Linear ResNet-50
31:50 - Zero-Shot vs Few-Shot
35:35 - Scaling Properties
36:35 - Comparison on different tasks
37:40 - Robustness to Data Shift
44:20 - Broader Impact Section
47:00 - Conclusion & Comments

Paper: https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
Blog: https://openai.com/blog/clip/
Code: https://github.com/openai/CLIP

Abstract:
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
Minds: https://www.minds.com/ykilcher
Parler: https://parler.com/profile/YannicKilcher
LinkedIn: https://www.linkedin.com/in/yannic-kilcher-488534136/
BiliBili: https://space.bilibili.com/1824646584

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2021-02-25	DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)
2021-02-19	Dreamer v2: Mastering Atari with Discrete World Models (Machine Learning Research Paper Explained)
2021-02-17	TransGAN: Two Transformers Can Make One Strong GAN (Machine Learning Research Paper Explained)
2021-02-14	NFNets: High-Performance Large-Scale Image Recognition Without Normalization (ML Paper Explained)
2021-02-11	Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention (AI Paper Explained)
2021-02-04	Deep Networks Are Kernel Machines (Paper Explained)
2021-02-02	Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
2021-01-29	SingularityNET - A Decentralized, Open Market and Network for AIs (Whitepaper Explained)
2021-01-22	Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
2021-01-17	STOCHASTIC MEME DESCENT - Deep Learning Meme Review - Episode 2 (Part 2 of 2)
2021-01-12	OpenAI CLIP: ConnectingText and Images (Paper Explained)
2021-01-06	OpenAI DALL·E: Creating Images from Text (Blog Post Explained)
2020-12-26	Extracting Training Data from Large Language Models (Paper Explained)
2020-12-24	MEMES IS ALL YOU NEED - Deep Learning Meme Review - Episode 2 (Part 1 of 2)
2020-12-16	ReBeL - Combining Deep Reinforcement Learning and Search for Imperfect-Information Games (Explained)
2020-12-13	2M All-In into $5 Pot! WWYD? Daniel Negreanu's No-Limit Hold'em Challenge! (Poker Hand Analysis)
2020-12-01	DeepMind's AlphaFold 2 Explained! AI Breakthrough in Protein Folding! What we know (& what we don't)
2020-11-29	Predictive Coding Approximates Backprop along Arbitrary Computation Graphs (Paper Explained)
2020-11-22	Fourier Neural Operator for Parametric Partial Differential Equations (Paper Explained)
2020-11-15	[News] Soccer AI FAILS and mixes up ball and referee's bald head.
2020-11-10	Underspecification Presents Challenges for Credibility in Modern Machine Learning (Paper Explained)

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

openai

sutskever

radford

meme

dalle

dall-e

images

vision

text

nlp

natural language processing

resnet

vision transformer

transformer

visual transformer

sota

state of the art

zero shot

zero-shot

few shot

few-shot

unsupervised

contrastive

simclr

efficientnet

noisy student

representation

embedding

latent

natural language

prompt engineering

bias

scale

distribution shift

Channel	Latest
PopCross Studios	7 hours ago
Kage848	8 hours ago
Flik's Gaming Stuff	8 hours ago
TheREALRandomLozzie!!	10 hours ago
RTGame	11 hours ago
Dawko	12 hours ago
MKIceAndFire	12 hours ago
IntroGameOver	12 hours ago
Badaw Gaming	13 hours ago
alanzoka	13 hours ago
oGVexx	14 hours ago
CarbotAnimations	14 hours ago
BanryuTV	16 hours ago
Icehiteru	16 hours ago
raocow	17 hours ago
Grimith	18 hours ago
Caner Akçay	20 hours ago
whitemoca	20 hours ago
LevelUp Legends	20 hours ago
Kamar Rama	20 hours ago
mariey tv	20 hours ago
Yuichiro Gaming	20 hours ago
RIJEKKK	20 hours ago
상상상상	20 hours ago
69SportTV	20 hours ago