SORA Deep Dive: Predict patches from text, images or video
SORA by OpenAI is one of the best models I've seen for generating videos from text-based descriptions. Not only that, it can take a single image frame and make it a video, change a video according to some text instruction, or even interpolate between two videos.
It is simply magical to see how large scale web-based video-text and image-text training can do to create this model.
One of the key reasons why SORA works so well is that it breaks down a video into spacetime patches, and does a diffusion process to predict each patch. As such, we can condition the diffusion process to have some priors, like reference images, reference videos, instruction text and many more.
Diffusion for image / video generation is indeed powerful, and I am keen to discuss and understand more of this.
If this video generation can be done well from just text or example image/videos, it could very well be used to create a simulator of the world and even be used for decision making for robots.
Join me to discuss more about this new technology.
~~~
References:
SORA main page: https://openai.com/sora
SORA technical report: https://openai.com/research/video-generation-models-as-world-simulators
OpenAI CLIP Image and Text Embeddings: https://arxiv.org/abs/2103.00020
DALL-E: https://arxiv.org/abs/2102.12092
DALL-E 2: https://arxiv.org/abs/2204.06125
DALL-E 3: https://cdn.openai.com/papers/dall-e-3.pdf
Stable Diffusion: https://arxiv.org/abs/2112.10752
Stable Diffusion XL - Making Stable Diffusion more high res: https://arxiv.org/abs/2307.01952
Stable Diffusion 3: https://arxiv.org/pdf/2403.03206.pdf
ControlNet - adding more conditions to Stable Diffusion: https://arxiv.org/abs/2302.05543
I-JEPA (META): https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
V-JEPA (META): https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/
Make-a-video (META): https://ai.meta.com/blog/generative-ai-text-to-video/
Imagen (Google): https://arxiv.org/abs/2205.11487
Blog comparing between DALL-E, Stable Diffusion, Imagen: https://tryolabs.com/blog/2022/08/31/from-dalle-to-stable-diffusion
Denoising Diffusion Probabilistic Models (DDPM) - Diffusion in Pixel Space: https://arxiv.org/abs/2006.11239
Paper Attempting to reverse engineer SORA (I only agree with 20% of the paper): https://arxiv.org/abs/2402.17177
Vision Transformer: https://arxiv.org/abs/2010.11929
Good blog post about Vision Transformer: https://towardsdatascience.com/vision-transformers-explained-a9d07147e4c8
Diffusion Transformer: https://arxiv.org/abs/2212.09748
Recaptioning Images with Positional-based information can already make text-image encoder learn spatial relations: https://spright-t2i.github.io/
~~~
0:00 Introduction
1:40 Example Video
3:44 Limitations of SORA
8:02 SORA Overview
20:18 Transformer next-token prediction vs SORA patch prediction
29:20 Vision Transformer (ViT)
42:12 SORA splits multiple images into 3D patches
50:05 Step 1: Video Compression
55:50 Step 2: Compressed Video to Spacetime Patches
1:00:11 Spacetime Patches
1:04:22 Diffusion Transformer
1:17:05 CLIP Embeddings for Image and Text
1:18:05 Stable Diffusion (Image)
1:22:46 Stable Diffusion (Video)
1:28:18 Putting it all together
1:34:57 Diffusion Generates Better Videos
1:36:36 ControlNet
1:14:13 Other work on video latent spaces
1:43:57 Discussion
2:09:31 Conclusion
~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin