Jiafei Duan: Uncovering the 'Right' Representations for Multimodal LLMs for Robotics

Subscribers:
5,330
Published on ● Video Link: https://www.youtube.com/watch?v=koxyRNNDpzk



Duration: 0:00
243 views
6


Speaker Profile:
Jiafei Duan is a third-year PhD student in robotics at the University of Washington’s Paul G. Allen School of Computer Science & Engineering, where he is part of the Robotics and State Estimation Lab, co-advised by Professors Dieter Fox and Ranjay Krishna. His research focuses on robot learning, embodied AI, foundation models, and computer vision. He is currently funded by the National Science Foundation (NSF) Graduate Research Fellowship. Previously, he was with NVIDIA Research and ASTAR Research.
http://www.duanjiafei.com/

Featured Papers:
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation: https://arxiv.org/abs/2410.00371
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics: https://arxiv.org/abs/2406.10721
Octopi: Object Property Reasoning with Large Tactile-Language Models: https://arxiv.org/abs/2405.02794

Abstract:
Recent advancements have shown the potential of multi-modal large language models (MLLMs) and large language models (LLMs) to automate several high-level tasks in robotics, such as task planning, reward function generation, action primitive code generation, and success verification. However, key questions remain: Are existing open-source and proprietary MLLMs/LLMs adequate for robotics, or is there a need for domain-specific models? What constitutes an optimal representation for robotics-focused MLLMs? Moreover, can we develop a unified MLLM fine-tuned specifically for robotics applications? In this talk, I aim to explore and address some of these questions through our recent efforts in instruction-tuning MLLMs for robotics.

~~~

0:00 Introduction
1:11 Background of Foundation Models
8:42 AHA: VLM for Reasoning over Failures
24:17 RoboPoint: VLM for Spatial Affordance Prediction (“Pointing”)
32:44 Octopi: Object Property Reasoning with Tactile-Language Models
40:18 Discussion
1:28:15 Conclusion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin




Other Videos By John Tan Chong Min


2025-02-24Vibe Coding: How to use LLM prompts to code effectively!
2025-01-26PhD Thesis Overview (Part 2): LLMs for ARC-AGI, Task-Based Memory-Infused Learning, Plan for AgentJo
2025-01-20PhD Thesis Overview (Part 1): Reward is not enough; Towards Goal-Directed, Memory-based Learning
2024-12-04AgentJo CV Generator: Generate your CV by searching for your profile on the web!
2024-11-11Can LLMs be used in self-driving? CoMAL: Collaborative Multi-Agent LLM for Mixed Autonomy Traffic
2024-10-28From TaskGen to AgentJo: Creating My Life Dream of Fast Learning and Adaptable Agents
2024-10-21Tian Yu X John: Discussing Practical Gen AI Tips for Image Prompting
2024-10-08Jiafei Duan: Uncovering the 'Right' Representations for Multimodal LLMs for Robotics
2024-09-27TaskGen Tutorial 6: Conversation Wrapper
2024-09-26TaskGen Tutorial 5: External Functions & CodeGen
2024-09-24TaskGen Tutorial 4: Hierarchical Agents
2024-09-23TaskGen Tutorial 3: Memory
2024-09-19TaskGen Tutorial 2: Shared Variables and Global Context
2024-09-16Beyond Strawberry: gpt-o1 - Is LLM alone sufficient for reasoning?
2024-09-11TaskGen Tutorial 1: Agents and Equipped Functions
2024-09-11TaskGen Tutorial 0: StrictJSON
2024-09-10LLM-Modulo: Using Critics and Verifiers to Improve Grounding of a Plan - Explanation + Improvements
2024-09-06TaskGen: Co-create the best open-sourced LLM Agentic Framework together!
2024-08-21AriGraph (Part 2) - Knowledge Graph Construction and Retrieval Details
2024-08-13alphaXiv - Share Ideas, Build Collective Understanding, Interact with ANY open sourced paper authors
2024-07-30AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents