CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)

Channel:

Yannic Kilcher

Subscribers:

291,000

Published on February 17, 2022 11:04:45 PM ● Video Link: https://www.youtube.com/watch?v=qNfCVGbvnJc

Duration: 1:24:20

12,857 views

378

#cm3 #languagemodel #transformer

This video contains a paper explanation and an incredibly informative interview with first author Armen Aghajanyan.
Autoregressive Transformers have come to dominate many fields in Machine Learning, from text generation to image creation and many more. However, there are two problems. First, the collected data is usually scraped from the web and uni- or bi-modal and throws away a lot of structure of the original websites, and second, language modelling losses are uni-directional. CM3 addresses both problems: It directly operates on HTML and includes text, hyperlinks, and even images (via VQGAN tokenization) and can therefore be used in plenty of ways: Text generation, captioning, image creation, entity linking, and much more. It also introduces a new training strategy called Causally Masked Language Modelling, which brings a level of bi-directionality into autoregressive language modelling. In the interview after the paper explanation, Armen and I go deep into the how and why of these giant models, we go over the stunning results and we make sense of what they mean for the future of universal models.

OUTLINE:
0:00 - Intro & Overview
6:30 - Directly learning the structure of HTML
12:30 - Causally Masked Language Modelling
18:50 - A short look at how to use this model
23:20 - Start of interview
25:30 - Feeding language models with HTML
29:45 - How to get bi-directionality into decoder-only Transformers?
37:00 - Images are just tokens
41:15 - How does one train such giant models?
45:40 - CM3 results are amazing
58:20 - Large-scale dataset collection and content filtering
1:04:40 - More experimental results
1:12:15 - Why don't we use raw HTML?
1:18:20 - Does this paper contain too many things?

Paper: https://arxiv.org/abs/2201.07520

Abstract:
We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

Authors: Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer

Links:
Merch: http://store.ykilcher.com
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick
YouTube: https://www.youtube.com/c/yannickilcher
Twitter: https://twitter.com/ykilcher
Discord: https://discord.gg/4H8xxDF
BitChute: https://www.bitchute.com/channel/yannic-kilcher
LinkedIn: https://www.linkedin.com/in/ykilcher
BiliBili: https://space.bilibili.com/2017636191

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannickilcher
Patreon: https://www.patreon.com/yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Other Videos By Yannic Kilcher

2022-03-05	OpenAI tackles Math - Formal Mathematics Statement Curriculum Learning (Paper Explained)
2022-03-04	[ML News] DeepMind controls fusion \| Yann LeCun's JEPA architecture \| US: AI can't copyright its art
2022-03-02	AlphaCode - with the authors!
2022-03-01	Competition-Level Code Generation with AlphaCode (Paper Review)
2022-02-28	Can Wikipedia Help Offline Reinforcement Learning? (Author Interview)
2022-02-26	Can Wikipedia Help Offline Reinforcement Learning? (Paper Explained)
2022-02-23	[ML Olds] Meta Research Supercluster \| OpenAI GPT-Instruct \| Google LaMDA \| Drones fight Pigeons
2022-02-21	Listening to You! - Channel Update (Author Interviews)
2022-02-20	All about AI Accelerators: GPU, TPU, Dataflow, Near-Memory, Optical, Neuromorphic & more (w/ Author)
2022-02-18	[ML News] Uber: Deep Learning for ETA \| MuZero Video Compression \| Block-NeRF \| EfficientNet-X
2022-02-17	CM3: A Causal Masked Multimodal Model of the Internet (Paper Explained w/ Author Interview)
2022-02-16	AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more!
2022-02-15	HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning (w/ Author)
2022-02-10	[ML News] DeepMind AlphaCode \| OpenAI math prover \| Meta battles harmful content with AI
2022-02-08	Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (+Author)
2022-02-07	OpenAI Embeddings (and Controversy?!)
2022-02-06	Unsupervised Brain Models - How does Deep Learning inform Neuroscience? (w/ Patrick Mineault)
2022-02-04	GPT-NeoX-20B - Open-Source huge language model by EleutherAI (Interview w/ co-founder Connor Leahy)
2022-01-29	Predicting the rules behind - Deep Symbolic Regression for Recurrent Sequences (w/ author interview)
2022-01-27	IT ARRIVED! YouTube sent me a package. (also: Limited Time Merch Deal)
2022-01-25	[ML News] ConvNeXt: Convolutions return \| China regulates algorithms \| Saliency cropping examined

Tags:

deep learning

machine learning

arxiv

explained

neural networks

artificial intelligence

paper

cm3

facebook ai

fair

meta ai

language model

language modelling

gpt-3

gpt 3

gpt3

dall-e

ru-dalle

text to image

ai image generation

ai internet

language model html

transformer html

large language models

transformer

autoregressive

causal masking

causally masked language model

bidirectional

bert

masked language modelling

Channel	Latest
HellfireComms	6 hours ago
Svarush	9 hours ago
Õhtuleht	9 hours ago
Pico Shogun	9 hours ago
Momoterasu	10 hours ago
Bass City	10 hours ago
ETwo4Three	10 hours ago
Henry Chhouk	10 hours ago
TueurDeBikette	10 hours ago
Suns	10 hours ago
Mati Clips	10 hours ago
Carlotta ASMR	10 hours ago
Shazam Sakazaki	10 hours ago
Cardboard Tube Knight	10 hours ago
ÉducaTube	11 hours ago
Jaegerchere	11 hours ago
lucas gameplays	11 hours ago
Darth Luke	11 hours ago
Ajarn Spencer	11 hours ago
Lazycorner07	11 hours ago
Christopher Leon Johnson	11 hours ago
Интроверт развлекает	11 hours ago
OPUS ASTORA	11 hours ago
Naikurio	11 hours ago
ابوعيد AbuEid	11 hours ago