I-JEPA: Importance of Predicting in Latent Space
I-JEPA is the first implementation of the Joint Embedding Predictive Architecture (JEPA) by Yann LeCun. I am a huge fan of LeCun, and many of my AI thoughts have been powered by his views as well. However, I am not in agreement with using Vision Transformers (ViT) as the encoder, as it loses most semantic information about the spatial component of images. Furthermore, it takes a long time to learn as it does not have the relevant inductive biases for learning images (a.k.a. translational invariance).
While I-JEPA achieves quite amazing downstream task performance like on the ImageNet Top-1 prediction task, it could perhaps be better if the masked objective can be done on a CNN-like architecture instead, with self-attention layers perhaps over the post-filter outputs.
We could also explore doing a Stable-Diffusion-like conditioning, whereby the predictor module is conditioned on some text input to predict the latent space. Broad-level to specific-level conditioning, and using memory of similar latent spaces, is also something that can be explored. In the end, I believe a hierarchical architecture, going from broad to specific, with each layer of abstraction conditioning on the broader layer of abstraction above, and finally attention between all the generated layers of abstraction (or latent space) to use for prediction could be a better bet.
That said, I-JEPA is a promising first step, and I am excited to see what comes next.
~~~~~~~~~~~~~~~~~~
Slides: https://github.com/tanchongmin/TensorFlow-Implementations/blob/main/Paper_Reviews/I-JEPA.pdf
Reference Materials:
I-JEPA: https://arxiv.org/abs/2301.08243
Vision Transformers: https://arxiv.org/abs/2010.11929
Swin Transformers (Transformers with hierarchy and shifting attention windows): https://arxiv.org/abs/2103.14030
MLP-Mixer (All MLP only image processing): https://arxiv.org/abs/2105.01601
Conv-Mixer (Patches with Conv layers): https://arxiv.org/abs/2201.09792
Stable Diffusion: https://arxiv.org/abs/2112.10752
~~~~~~~~~~~~~~~~~~
(0:00) Introduction
5:54 Transformers: Prediction back in input space
11:12 Prediction in Latent Space
22:25 Stable Diffusion and Latent Space
29:17 Vision Transformer (ViT)
44:57 Swin Transformer
50:12 ViT’s positional encoding may not be good!
51:38 I-JEPA
1:09:26 Discussion on how to improve I-JEPA
~~~~~~~~~~~~~~~~~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin