SORA Deep Dive: Predict patches from text, images or video

Subscribers:
5,330
Published on ● Video Link: https://www.youtube.com/watch?v=0UbARoIediY



Duration: 2:10:13
311 views
20


SORA by OpenAI is one of the best models I've seen for generating videos from text-based descriptions. Not only that, it can take a single image frame and make it a video, change a video according to some text instruction, or even interpolate between two videos.

It is simply magical to see how large scale web-based video-text and image-text training can do to create this model.

One of the key reasons why SORA works so well is that it breaks down a video into spacetime patches, and does a diffusion process to predict each patch. As such, we can condition the diffusion process to have some priors, like reference images, reference videos, instruction text and many more.

Diffusion for image / video generation is indeed powerful, and I am keen to discuss and understand more of this.

If this video generation can be done well from just text or example image/videos, it could very well be used to create a simulator of the world and even be used for decision making for robots.
Join me to discuss more about this new technology.

~~~
References:
SORA main page: https://openai.com/sora
SORA technical report: https://openai.com/research/video-generation-models-as-world-simulators
OpenAI CLIP Image and Text Embeddings: https://arxiv.org/abs/2103.00020
DALL-E: https://arxiv.org/abs/2102.12092
DALL-E 2: https://arxiv.org/abs/2204.06125
DALL-E 3: https://cdn.openai.com/papers/dall-e-3.pdf
Stable Diffusion: https://arxiv.org/abs/2112.10752
Stable Diffusion XL - Making Stable Diffusion more high res: https://arxiv.org/abs/2307.01952
Stable Diffusion 3: https://arxiv.org/pdf/2403.03206.pdf
ControlNet - adding more conditions to Stable Diffusion: https://arxiv.org/abs/2302.05543
I-JEPA (META): https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
V-JEPA (META): https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Make-a-video (META): https://ai.meta.com/blog/generative-ai-text-to-video/
Imagen (Google): https://arxiv.org/abs/2205.11487
Blog comparing between DALL-E, Stable Diffusion, Imagen: https://tryolabs.com/blog/2022/08/31/from-dalle-to-stable-diffusion
Denoising Diffusion Probabilistic Models (DDPM) - Diffusion in Pixel Space: https://arxiv.org/abs/2006.11239

Paper Attempting to reverse engineer SORA (I only agree with 20% of the paper): https://arxiv.org/abs/2402.17177
Vision Transformer: https://arxiv.org/abs/2010.11929
Good blog post about Vision Transformer: https://towardsdatascience.com/vision-transformers-explained-a9d07147e4c8
Diffusion Transformer: https://arxiv.org/abs/2212.09748

Recaptioning Images with Positional-based information can already make text-image encoder learn spatial relations: https://spright-t2i.github.io/

~~~

0:00 Introduction
1:40 Example Video
3:44 Limitations of SORA
8:02 SORA Overview
20:18 Transformer next-token prediction vs SORA patch prediction
29:20 Vision Transformer (ViT)
42:12 SORA splits multiple images into 3D patches
50:05 Step 1: Video Compression
55:50 Step 2: Compressed Video to Spacetime Patches
1:00:11 Spacetime Patches
1:04:22 Diffusion Transformer
1:17:05 CLIP Embeddings for Image and Text
1:18:05 Stable Diffusion (Image)
1:22:46 Stable Diffusion (Video)
1:28:18 Putting it all together
1:34:57 Diffusion Generates Better Videos
1:36:36 ControlNet
1:14:13 Other work on video latent spaces
1:43:57 Discussion
2:09:31 Conclusion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin




Other Videos By John Tan Chong Min


2024-07-17Intelligence = Sampling + Filtering
2024-07-12Michael Hodel: Reverse Engineering the Abstraction and Reasoning Corpus
2024-07-02TaskGen Conversational Class v2: JARVIS, Psychology Counsellor, Sherlock Holmes Shop Assistant
2024-06-04CodeAct: Code As Action Space of LLM Agents - Pros and Cons
2024-05-28TaskGen Conversation with Dynamic Memory - Math Quizbot, Escape Room Solver, Psychology Counsellor
2024-05-21Integrate ANY Python Function, CodeGen, CrewAI tool, LangChain tool with TaskGen! - v2.3.0
2024-05-11Empirical - Open Source LLM Evaluation UI
2024-05-07TaskGen Ask Me Anything #1
2024-04-29StrictJSON (LLM Output Parser) Ask Me Anything #1
2024-04-22Tutorial #14: Write latex papers with LLMs such as Llama 3!
2024-04-16SORA Deep Dive: Predict patches from text, images or video
2024-04-09OpenAI CLIP Embeddings: Walkthrough + Insights
2024-03-26TaskGen - LLM Agentic Framework that Does More, Talks Less: Shared Variables, Memory, Global Context
2024-03-18CRADLE (Part 2): An AI that can play Red Dead Dedemption 2. Reflection, Memory, Task-based Planning
2024-03-11CRADLE (Part 1) - AI that plays Red Dead Redemption 2. Towards General Computer Control and AGI
2024-03-05TaskGen - A Task-based Agentic Framework using StrictJSON at the core
2024-02-27SymbolicAI / ExtensityAI Paper Overview (Part 2) - Evaluation Benchmark Discussion!
2024-02-20SymbolicAI / ExtensityAI Paper Overview (Part 1) - Key Philosophy Behind the Design - Symbols
2024-02-13Embeddings Walkthrough (Part 2): Context-Dependent Embeddings, Shifting Embedding Space
2024-02-06Embeddings Walkthrough (Part 1) - Bag of Words to word2vec to Transformer contextual embeddings
2024-01-29V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!