SORA Deep Dive: Predict patches from text, images or video

Channel:

John Tan Chong Min

Subscribers:

5,450

Published on April 16, 2024 8:05:46 AM ● Video Link: https://www.youtube.com/watch?v=0UbARoIediY

Duration: 2:10:13

311 views

SORA by OpenAI is one of the best models I've seen for generating videos from text-based descriptions. Not only that, it can take a single image frame and make it a video, change a video according to some text instruction, or even interpolate between two videos.

It is simply magical to see how large scale web-based video-text and image-text training can do to create this model.

One of the key reasons why SORA works so well is that it breaks down a video into spacetime patches, and does a diffusion process to predict each patch. As such, we can condition the diffusion process to have some priors, like reference images, reference videos, instruction text and many more.

Diffusion for image / video generation is indeed powerful, and I am keen to discuss and understand more of this.

If this video generation can be done well from just text or example image/videos, it could very well be used to create a simulator of the world and even be used for decision making for robots.
Join me to discuss more about this new technology.

~~~
References:
SORA main page: https://openai.com/sora
SORA technical report: https://openai.com/research/video-generation-models-as-world-simulators
OpenAI CLIP Image and Text Embeddings: https://arxiv.org/abs/2103.00020
DALL-E: https://arxiv.org/abs/2102.12092
DALL-E 2: https://arxiv.org/abs/2204.06125
DALL-E 3: https://cdn.openai.com/papers/dall-e-3.pdf
Stable Diffusion: https://arxiv.org/abs/2112.10752
Stable Diffusion XL - Making Stable Diffusion more high res: https://arxiv.org/abs/2307.01952
Stable Diffusion 3: https://arxiv.org/pdf/2403.03206.pdf
ControlNet - adding more conditions to Stable Diffusion: https://arxiv.org/abs/2302.05543
I-JEPA (META): https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
V-JEPA (META): https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Make-a-video (META): https://ai.meta.com/blog/generative-ai-text-to-video/
Imagen (Google): https://arxiv.org/abs/2205.11487
Blog comparing between DALL-E, Stable Diffusion, Imagen: https://tryolabs.com/blog/2022/08/31/from-dalle-to-stable-diffusion
Denoising Diffusion Probabilistic Models (DDPM) - Diffusion in Pixel Space: https://arxiv.org/abs/2006.11239

Paper Attempting to reverse engineer SORA (I only agree with 20% of the paper): https://arxiv.org/abs/2402.17177
Vision Transformer: https://arxiv.org/abs/2010.11929
Good blog post about Vision Transformer: https://towardsdatascience.com/vision-transformers-explained-a9d07147e4c8
Diffusion Transformer: https://arxiv.org/abs/2212.09748

Recaptioning Images with Positional-based information can already make text-image encoder learn spatial relations: https://spright-t2i.github.io/

~~~

0:00 Introduction
1:40 Example Video
3:44 Limitations of SORA
8:02 SORA Overview
20:18 Transformer next-token prediction vs SORA patch prediction
29:20 Vision Transformer (ViT)
42:12 SORA splits multiple images into 3D patches
50:05 Step 1: Video Compression
55:50 Step 2: Compressed Video to Spacetime Patches
1:00:11 Spacetime Patches
1:04:22 Diffusion Transformer
1:17:05 CLIP Embeddings for Image and Text
1:18:05 Stable Diffusion (Image)
1:22:46 Stable Diffusion (Video)
1:28:18 Putting it all together
1:34:57 Diffusion Generates Better Videos
1:36:36 ControlNet
1:14:13 Other work on video latent spaces
1:43:57 Discussion
2:09:31 Conclusion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin

Other Videos By John Tan Chong Min

2024-07-17	Intelligence = Sampling + Filtering
2024-07-12	Michael Hodel: Reverse Engineering the Abstraction and Reasoning Corpus
2024-07-02	TaskGen Conversational Class v2: JARVIS, Psychology Counsellor, Sherlock Holmes Shop Assistant
2024-06-04	CodeAct: Code As Action Space of LLM Agents - Pros and Cons
2024-05-28	TaskGen Conversation with Dynamic Memory - Math Quizbot, Escape Room Solver, Psychology Counsellor
2024-05-21	Integrate ANY Python Function, CodeGen, CrewAI tool, LangChain tool with TaskGen! - v2.3.0
2024-05-11	Empirical - Open Source LLM Evaluation UI
2024-05-07	TaskGen Ask Me Anything #1
2024-04-29	StrictJSON (LLM Output Parser) Ask Me Anything #1
2024-04-22	Tutorial #14: Write latex papers with LLMs such as Llama 3!
2024-04-16	SORA Deep Dive: Predict patches from text, images or video
2024-04-09	OpenAI CLIP Embeddings: Walkthrough + Insights
2024-03-26	TaskGen - LLM Agentic Framework that Does More, Talks Less: Shared Variables, Memory, Global Context
2024-03-18	CRADLE (Part 2): An AI that can play Red Dead Dedemption 2. Reflection, Memory, Task-based Planning
2024-03-11	CRADLE (Part 1) - AI that plays Red Dead Redemption 2. Towards General Computer Control and AGI
2024-03-05	TaskGen - A Task-based Agentic Framework using StrictJSON at the core
2024-02-27	SymbolicAI / ExtensityAI Paper Overview (Part 2) - Evaluation Benchmark Discussion!
2024-02-20	SymbolicAI / ExtensityAI Paper Overview (Part 1) - Key Philosophy Behind the Design - Symbols
2024-02-13	Embeddings Walkthrough (Part 2): Context-Dependent Embeddings, Shifting Embedding Space
2024-02-06	Embeddings Walkthrough (Part 1) - Bag of Words to word2vec to Transformer contextual embeddings
2024-01-29	V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!

Channel	Latest
Hil6175_rblx	6 hours ago
Hijuga	6 hours ago
강자	6 hours ago
Beverlyビバリー	6 hours ago
Garena Free Fire VN	6 hours ago
AgentJ Gaming	6 hours ago
Galih Dys	6 hours ago
Soccer Gameplay	7 hours ago
POWER OF GAME	7 hours ago
笠希々	7 hours ago
Dunkelschloss	7 hours ago
Hendri Pusi	7 hours ago
Yusuke Yamamoto [Otaku President]	7 hours ago
よっしぃ game channel	7 hours ago
フリーランスなおきち広島弁ゲーム実況	7 hours ago
Inazuma Hissatsu	7 hours ago
Atomix Knight	7 hours ago
阿德 (藝圓創)	7 hours ago
MRSyonicBoom	7 hours ago
Ray noa	7 hours ago
Tama Ch	7 hours ago
aulddragon	7 hours ago
やまだちゃんねる	7 hours ago
DJ Neon Panda And Scorch Gaming	7 hours ago
Krosmaster Team Spain	7 hours ago