V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!

Channel:

John Tan Chong Min

Subscribers:

5,450

Published on January 30, 2024 6:15:50 AM ● Video Link: https://www.youtube.com/watch?v=ZWEamtm4yHA

Duration: 1:33:15

315 views

Is V* really better than GPT-4V at Visual Question Answering (VQA)?

V* is a way to augment the prompt for Visual Question Answer (VQA) to be more than just the image and the question itself, but also a list of target objects that can help with the question and their positions.

This list of target objects can be found via a Visual Search Model. This Visual Search Model starts off with the full image and tries to find the target object's bounding boxes. If unable to find, it uses the heatmap which matches a contextual cue to the target object, and identifies a quadrant of the original image which the target object can be found in. The process then continues with this quadrant until we reach the minimum image size.

This iterative focusing of the image helps to mitigate the lack of positional sensitivity of the Vision Transformer embeddings for the image encoding of the multimodal Large Language Model (LLM).

In general, this approach of adding relevant context and searching the image by focusing on the right sub-sections is a very powerful one. I also show that if we incorporate some aspects of the V* method into GPT-4V, it can help improve the performance of GPT-4V!

~~~
References:
V* Github: https://vstar-seal.github.io/
My Slides: https://github.com/tanchongmin/TensorFlow-Implementations/blob/main/Paper_Reviews/Vstar.pdf

Vision Transformer: https://arxiv.org/abs/2010.11929
CLIP embeddings: https://arxiv.org/abs/2103.00020
LLaVA: Large Language and Vision Assistant: https://arxiv.org/abs/2304.08485
GPT-4V Technical Report: https://arxiv.org/abs/2303.08774
Chain-of-thought: https://arxiv.org/abs/2201.11903
ReAct framework: https://arxiv.org/abs/2210.03629

~~~

0:00 Introduction
2:13 Key issue with CLIP embeddings based on ViT
13:31 Background: LlaVA
18:05 Overall Walkthrough of V*
31:12 Visual Search Model
39:52 Visual QA Model
43:33 Iterative Visual Search
45:05 V* Example 1
52:00 V* Example 2
59:26 A form of Best First Search
1:02:42 How to improve V* (Great discussion with Richard)
1:10:46 Putting Everything Together
1:13:38 Comparison: Chain of Thought
1:15:17 Comparison: ReAct Framework
1:16:53 Results
1:22:11 My experiments: Incorporating V* into GPT-4V
1:23:32 V* is actually less generic than GPT-4V
1:24:33 V* heuristic-based search based on heat map is similar to human fixation!
1:26:10 My takeaways
1:26:47 Discussion and Conclusion

~~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin

Other Videos By John Tan Chong Min

2024-04-16	SORA Deep Dive: Predict patches from text, images or video
2024-04-09	OpenAI CLIP Embeddings: Walkthrough + Insights
2024-03-26	TaskGen - LLM Agentic Framework that Does More, Talks Less: Shared Variables, Memory, Global Context
2024-03-18	CRADLE (Part 2): An AI that can play Red Dead Dedemption 2. Reflection, Memory, Task-based Planning
2024-03-11	CRADLE (Part 1) - AI that plays Red Dead Redemption 2. Towards General Computer Control and AGI
2024-03-05	TaskGen - A Task-based Agentic Framework using StrictJSON at the core
2024-02-27	SymbolicAI / ExtensityAI Paper Overview (Part 2) - Evaluation Benchmark Discussion!
2024-02-20	SymbolicAI / ExtensityAI Paper Overview (Part 1) - Key Philosophy Behind the Design - Symbols
2024-02-13	Embeddings Walkthrough (Part 2): Context-Dependent Embeddings, Shifting Embedding Space
2024-02-06	Embeddings Walkthrough (Part 1) - Bag of Words to word2vec to Transformer contextual embeddings
2024-01-29	V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!
2024-01-23	AutoGen: A Multi-Agent Framework - Overview and Improvements
2024-01-09	AppAgent: Using GPT-4V to Navigate a Smartphone!
2024-01-08	Tutorial #13: StrictJSON, my first Python Package! - Get LLMs to output into a working JSON!
2023-12-20	"Are you smarter than an LLM?" game speedrun
2023-12-08	Is Gemini better than GPT4? Self-created benchmark - Fact Retrieval/Checking, Coding, Tool Use
2023-12-04	Learning, Fast and Slow: 10 Years Plan - Memory Soup, Hier. Planning, Emotions, Knowledge Sharing
2023-12-01	Tutorial #12: Use ChatGPT and off-the-shelf RAG on Terminal/Command Prompt/Shell - SymbolicAI
2023-11-20	JARVIS-1: Multi-modal (Text + Image) Memory + Decision Making with LLMs in MineCraft!
2023-11-20	Tutorial #11: Virtual Persona from Documents, Multi-Agent Chat, Text-to-Speech to hear your Personas
2023-11-14	A Roadmap for AI: Past, Present and Future (Part 3) - Multi-Agent, Multiple Sampling and Filtering

Channel	Latest
강자	6 hours ago
Beverlyビバリー	6 hours ago
Garena Free Fire VN	6 hours ago
AgentJ Gaming	6 hours ago
Soccer Gameplay	6 hours ago
POWER OF GAME	6 hours ago
笠希々	6 hours ago
Dunkelschloss	6 hours ago
Yusuke Yamamoto [Otaku President]	6 hours ago
よっしぃ game channel	6 hours ago
フリーランスなおきち広島弁ゲーム実況	6 hours ago
Atomix Knight	7 hours ago
阿德 (藝圓創)	7 hours ago
Tama Ch	7 hours ago
やまだちゃんねる	7 hours ago
Krosmaster Team Spain	7 hours ago
fin	7 hours ago
MacTom	7 hours ago
Kikoskia	7 hours ago
ゆっくり田んぼ	7 hours ago
TTKT Studio	7 hours ago
TOHO animation	7 hours ago
Dan Field	7 hours ago
ゆあちゃんねる / Yua Channel	7 hours ago
アサルトサイジ1プレイ動画も上げてます	7 hours ago