DINOv3: One backbone, multiple image/video tasks
Can we learn using self-supervised learning and then map it over to a suite of vision tasks like object detection, semantic segmentation, video segmentation, 3D understanding, object classification and many more?
Turns out we can, as DINO has shown, by using a teacher-student method to let the student learn from the teacher's vector projection, and updating the teacher once in a while.
To further improve performance, DINOv3 uses a GRAM anchoring loss to align the feature representation of the student to the teacher to better preserve the patch semantic features in the early part of the training run.
~~~
Slides: https://github.com/tanchongmin/john-youtube/blob/main/Discussion_Sessions/DINOv3.pdf
DINOv3 Paper: https://arxiv.org/pdf/2508.10104
Related Papers:
DINOv2 (with patch occlusion iBOT loss): https://arxiv.org/pdf/2304.07193
DINO (with global-local teacher-student loss): https://arxiv.org/pdf/2104.14294
Vision Transformers (backbone of DINO): https://arxiv.org/pdf/2010.11929
Neural Style Transfer (using Gram Matrix to capture the style of an image): https://arxiv.org/pdf/1705.04058
AppAgent (using Vision Model and XML for positional information on smartphones): https://arxiv.org/pdf/2312.13771
BERT (single backbone, multiple downstream tasks for text-based modality; similar to DINO for image-based modality): https://arxiv.org/pdf/1810.04805
~~~
0:00 Introduction
3:04 Rise of Self-Supervised Learning
25:33 Exponential Moving Average
30:11 Vision Transformer (ViT)
41:47 Key Takeaway: Single backbone, multiple downstream tasks
42:11 DINO: Teacher-student learning
46:11 Gram Anchoring in DINOv3
47:57 Data Curation is very important
51:33 DINOv3 has high-res feature map representation
53:14 Gram Anchoring Loss
1:12:12 Gram loss repairs damage of semantic features drift for patches
1:14:10 Higher Input Resolution is Better
1:19:38 DINOv3 is SOTA in many downstream tasks
1:22:31 Loss Functions in DINOv3
1:29:32 Discussion
1:46:55 Conclusion
~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin