DINOv3: One backbone, multiple image/video tasks

Subscribers:
6,300
Published on ● Video Link: https://www.youtube.com/watch?v=Ou0pfOZUaJU



Duration: 0:00
364 views
26


Can we learn using self-supervised learning and then map it over to a suite of vision tasks like object detection, semantic segmentation, video segmentation, 3D understanding, object classification and many more?

Turns out we can, as DINO has shown, by using a teacher-student method to let the student learn from the teacher's vector projection, and updating the teacher once in a while.

To further improve performance, DINOv3 uses a GRAM anchoring loss to align the feature representation of the student to the teacher to better preserve the patch semantic features in the early part of the training run.

~~~

Slides: https://github.com/tanchongmin/john-youtube/blob/main/Discussion_Sessions/DINOv3.pdf
DINOv3 Paper: https://arxiv.org/pdf/2508.10104

Related Papers:
DINOv2 (with patch occlusion iBOT loss): https://arxiv.org/pdf/2304.07193
DINO (with global-local teacher-student loss): https://arxiv.org/pdf/2104.14294
Vision Transformers (backbone of DINO): https://arxiv.org/pdf/2010.11929
Neural Style Transfer (using Gram Matrix to capture the style of an image): https://arxiv.org/pdf/1705.04058
AppAgent (using Vision Model and XML for positional information on smartphones): https://arxiv.org/pdf/2312.13771
BERT (single backbone, multiple downstream tasks for text-based modality; similar to DINO for image-based modality): https://arxiv.org/pdf/1810.04805

~~~

0:00 Introduction
3:04 Rise of Self-Supervised Learning
25:33 Exponential Moving Average
30:11 Vision Transformer (ViT)
41:47 Key Takeaway: Single backbone, multiple downstream tasks
42:11 DINO: Teacher-student learning
46:11 Gram Anchoring in DINOv3
47:57 Data Curation is very important
51:33 DINOv3 has high-res feature map representation
53:14 Gram Anchoring Loss
1:12:12 Gram loss repairs damage of semantic features drift for patches
1:14:10 Higher Input Resolution is Better
1:19:38 DINOv3 is SOTA in many downstream tasks
1:22:31 Loss Functions in DINOv3
1:29:32 Discussion
1:46:55 Conclusion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin




Other Videos By John Tan Chong Min


2025-09-08DINOv3: One backbone, multiple image/video tasks
2025-08-18R-Zero: Self-Evolving Reasoning LLM from Zero Data
2025-08-11Reasoning without Language (Part 2) - Deep Dive into 27 mil parameter Hierarchical Reasoning Model
2025-08-04Reasoning without Language - Deep Dive into 27 mil parameter Hierarchical Reasoning Model
2025-07-28No need for symbolic programs for Math? Natural language approach to IMO
2025-07-21How many instructions can LLMs follow at once?
2025-07-15Arjo Chakravarty: Indoor Localisation with Visual Language Models (VLMs)
2025-07-14MemOS: A Paradigm Shift to Memory as a First Class Citizen for LLMs
2025-07-07Multimodal Query for Images: Text/Image Multimodal Query with Negative Filter and Folder Selection
2025-06-30Universal Filter (Part 4 - Finale): Knowledge/Memory, Reflection, Communication between Individuals
2025-06-23Universal Filter (Part 3): Learning the Filters, Universal Database, Individual Knowledge Base
2025-06-16Universal Filter (Part 2): Time, Akashic Records, Individual Mind-based, Body-based memory
2025-06-04Good Vibes Only with Dylan Chia: Lyria (Music), Veo3 (Video), Gamma (Slides), GitHub Copilot (Code)
2025-03-10Memory Meets Psychology - Claude Plays Pokemon: How It works, How to improve it
2025-02-24Vibe Coding: How to use LLM prompts to code effectively!
2025-01-26PhD Thesis Overview (Part 2): LLMs for ARC-AGI, Task-Based Memory-Infused Learning, Plan for AgentJo
2025-01-20PhD Thesis Overview (Part 1): Reward is not enough; Towards Goal-Directed, Memory-based Learning
2024-12-04AgentJo CV Generator: Generate your CV by searching for your profile on the web!
2024-11-11Can LLMs be used in self-driving? CoMAL: Collaborative Multi-Agent LLM for Mixed Autonomy Traffic
2024-10-28From TaskGen to AgentJo: Creating My Life Dream of Fast Learning and Adaptable Agents
2024-10-21Tian Yu X John: Discussing Practical Gen AI Tips for Image Prompting