DINOv3: One backbone, multiple image/video tasks

Channel:

John Tan Chong Min

Subscribers:

6,300

Published on September 8, 2025 7:38:09 AM ● Video Link: https://www.youtube.com/watch?v=Ou0pfOZUaJU

Duration: 0:00

364 views

Can we learn using self-supervised learning and then map it over to a suite of vision tasks like object detection, semantic segmentation, video segmentation, 3D understanding, object classification and many more?

Turns out we can, as DINO has shown, by using a teacher-student method to let the student learn from the teacher's vector projection, and updating the teacher once in a while.

To further improve performance, DINOv3 uses a GRAM anchoring loss to align the feature representation of the student to the teacher to better preserve the patch semantic features in the early part of the training run.

~~~

Slides: https://github.com/tanchongmin/john-youtube/blob/main/Discussion_Sessions/DINOv3.pdf
DINOv3 Paper: https://arxiv.org/pdf/2508.10104

Related Papers:
DINOv2 (with patch occlusion iBOT loss): https://arxiv.org/pdf/2304.07193
DINO (with global-local teacher-student loss): https://arxiv.org/pdf/2104.14294
Vision Transformers (backbone of DINO): https://arxiv.org/pdf/2010.11929
Neural Style Transfer (using Gram Matrix to capture the style of an image): https://arxiv.org/pdf/1705.04058
AppAgent (using Vision Model and XML for positional information on smartphones): https://arxiv.org/pdf/2312.13771
BERT (single backbone, multiple downstream tasks for text-based modality; similar to DINO for image-based modality): https://arxiv.org/pdf/1810.04805

~~~

0:00 Introduction
3:04 Rise of Self-Supervised Learning
25:33 Exponential Moving Average
30:11 Vision Transformer (ViT)
41:47 Key Takeaway: Single backbone, multiple downstream tasks
42:11 DINO: Teacher-student learning
46:11 Gram Anchoring in DINOv3
47:57 Data Curation is very important
51:33 DINOv3 has high-res feature map representation
53:14 Gram Anchoring Loss
1:12:12 Gram loss repairs damage of semantic features drift for patches
1:14:10 Higher Input Resolution is Better
1:19:38 DINOv3 is SOTA in many downstream tasks
1:22:31 Loss Functions in DINOv3
1:29:32 Discussion
1:46:55 Conclusion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin

Other Videos By John Tan Chong Min

2025-09-08	DINOv3: One backbone, multiple image/video tasks
2025-08-18	R-Zero: Self-Evolving Reasoning LLM from Zero Data
2025-08-11	Reasoning without Language (Part 2) - Deep Dive into 27 mil parameter Hierarchical Reasoning Model
2025-08-04	Reasoning without Language - Deep Dive into 27 mil parameter Hierarchical Reasoning Model
2025-07-28	No need for symbolic programs for Math? Natural language approach to IMO
2025-07-21	How many instructions can LLMs follow at once?
2025-07-15	Arjo Chakravarty: Indoor Localisation with Visual Language Models (VLMs)
2025-07-14	MemOS: A Paradigm Shift to Memory as a First Class Citizen for LLMs
2025-07-07	Multimodal Query for Images: Text/Image Multimodal Query with Negative Filter and Folder Selection
2025-06-30	Universal Filter (Part 4 - Finale): Knowledge/Memory, Reflection, Communication between Individuals
2025-06-23	Universal Filter (Part 3): Learning the Filters, Universal Database, Individual Knowledge Base
2025-06-16	Universal Filter (Part 2): Time, Akashic Records, Individual Mind-based, Body-based memory
2025-06-04	Good Vibes Only with Dylan Chia: Lyria (Music), Veo3 (Video), Gamma (Slides), GitHub Copilot (Code)
2025-03-10	Memory Meets Psychology - Claude Plays Pokemon: How It works, How to improve it
2025-02-24	Vibe Coding: How to use LLM prompts to code effectively!
2025-01-26	PhD Thesis Overview (Part 2): LLMs for ARC-AGI, Task-Based Memory-Infused Learning, Plan for AgentJo
2025-01-20	PhD Thesis Overview (Part 1): Reward is not enough; Towards Goal-Directed, Memory-based Learning
2024-12-04	AgentJo CV Generator: Generate your CV by searching for your profile on the web!
2024-11-11	Can LLMs be used in self-driving? CoMAL: Collaborative Multi-Agent LLM for Mixed Autonomy Traffic
2024-10-28	From TaskGen to AgentJo: Creating My Life Dream of Fast Learning and Adaptable Agents
2024-10-21	Tian Yu X John: Discussing Practical Gen AI Tips for Image Prompting

Channel	Latest
HN엘리	6 hours ago
Meika Channel	6 hours ago
ShockWave	6 hours ago
ぐり子・	6 hours ago
Dimass Anggii	7 hours ago
Loyalrex	7 hours ago
kuroko哲平	7 hours ago
🔴Franix (VODs)	7 hours ago
Iris	7 hours ago
곰슬래쉬	8 hours ago
The Rosetta Stoned	8 hours ago
Criish	8 hours ago
skillgaming	8 hours ago
Kemikziel	8 hours ago
FriedBadger	8 hours ago
ASMR BlueKatie	8 hours ago
겜talks	8 hours ago
Iyuzdank room	8 hours ago
CupID15	8 hours ago
Torisu Kazoku	9 hours ago
Animations Trailer	9 hours ago
stevenrf7	9 hours ago
三上重工	9 hours ago
Zed the Insomniac	9 hours ago
雪咲ゆうか / Yuuka Yukisaki	9 hours ago