Jiafei Duan: Uncovering the 'Right' Representations for Multimodal LLMs for Robotics
Speaker Profile:
Jiafei Duan is a third-year PhD student in robotics at the University of Washington’s Paul G. Allen School of Computer Science & Engineering, where he is part of the Robotics and State Estimation Lab, co-advised by Professors Dieter Fox and Ranjay Krishna. His research focuses on robot learning, embodied AI, foundation models, and computer vision. He is currently funded by the National Science Foundation (NSF) Graduate Research Fellowship. Previously, he was with NVIDIA Research and ASTAR Research.
http://www.duanjiafei.com/
Featured Papers:
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation: https://arxiv.org/abs/2410.00371
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics: https://arxiv.org/abs/2406.10721
Octopi: Object Property Reasoning with Large Tactile-Language Models: https://arxiv.org/abs/2405.02794
Abstract:
Recent advancements have shown the potential of multi-modal large language models (MLLMs) and large language models (LLMs) to automate several high-level tasks in robotics, such as task planning, reward function generation, action primitive code generation, and success verification. However, key questions remain: Are existing open-source and proprietary MLLMs/LLMs adequate for robotics, or is there a need for domain-specific models? What constitutes an optimal representation for robotics-focused MLLMs? Moreover, can we develop a unified MLLM fine-tuned specifically for robotics applications? In this talk, I aim to explore and address some of these questions through our recent efforts in instruction-tuning MLLMs for robotics.
~~~
0:00 Introduction
1:11 Background of Foundation Models
8:42 AHA: VLM for Reasoning over Failures
24:17 RoboPoint: VLM for Spatial Affordance Prediction (“Pointing”)
32:44 Octopi: Object Property Reasoning with Tactile-Language Models
40:18 Discussion
1:28:15 Conclusion
~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin