AppAgent: Using GPT-4V to Navigate a Smartphone!

Channel:

John Tan Chong Min

Subscribers:

5,450

Published on January 9, 2024 7:54:32 AM ● Video Link: https://www.youtube.com/watch?v=U07rxKenYc4

Duration: 1:41:04

448 views

I am super intrigued by this paper using a simplified Domain Specific Language to navigate UI on smartphones.

Andrej Karpathy tried to do something similar years ago with Reinforcement Learning and failed.

The new era is here for AI automation over User Interfaces. Using LLM (GPT-4V) to generate its own documentation of the UI elements, and using ReAct framework (Observation, Thought, Action) over a simplified action space (Click on element, Back etc.) yields success rates of over 70% on the tested environments.

This increases to over 90% success rate using human crafted documentation.

The flexible pattern matching abilities of LLMs over a simplified Domain Specific Language (DSL) is indeed powerful.

I look forward to more papers/products using this approach to solve real world problems!

Side Note: GPT-4V cannot really do precise UI navigation yet, so the input view is a combination of the app screenshot, as well as an XML file which details where the UI elements are. It will be interesting to see how far vision models can be improved for more precise UI navigation where we can do use vision-only UI navigation.

Paper: https://arxiv.org/abs/2312.13771

Github: https://github.com/mnotgod96/AppAgent

~~~

Slides: https://github.com/tanchongmin/TensorFlow-Implementations/blob/main/Paper_Reviews/AppAgent.pdf

StrictJSON framework: https://www.youtube.com/watch?v=1N-znDTlhNc

~~~

0:00 Introduction
3:21 Motivation
5:14 ReAct framework
11:25 Overview of AppAgent
22:32 GPT-4V for Image/Text Processing
35:55 Ablation Studies on Image/Text Processing (My Own)
59:38 Domain Specific Language
1:01:03 Exploration Phase
1:19:53 Deployment Phase
1:27:11 Results and Discussion

~~~

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.

Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin

Other Videos By John Tan Chong Min

2024-03-26	TaskGen - LLM Agentic Framework that Does More, Talks Less: Shared Variables, Memory, Global Context
2024-03-18	CRADLE (Part 2): An AI that can play Red Dead Dedemption 2. Reflection, Memory, Task-based Planning
2024-03-11	CRADLE (Part 1) - AI that plays Red Dead Redemption 2. Towards General Computer Control and AGI
2024-03-05	TaskGen - A Task-based Agentic Framework using StrictJSON at the core
2024-02-27	SymbolicAI / ExtensityAI Paper Overview (Part 2) - Evaluation Benchmark Discussion!
2024-02-20	SymbolicAI / ExtensityAI Paper Overview (Part 1) - Key Philosophy Behind the Design - Symbols
2024-02-13	Embeddings Walkthrough (Part 2): Context-Dependent Embeddings, Shifting Embedding Space
2024-02-06	Embeddings Walkthrough (Part 1) - Bag of Words to word2vec to Transformer contextual embeddings
2024-01-29	V* - Better than GPT-4V? Iterative Context Refining for Visual Question Answer!
2024-01-23	AutoGen: A Multi-Agent Framework - Overview and Improvements
2024-01-09	AppAgent: Using GPT-4V to Navigate a Smartphone!
2024-01-08	Tutorial #13: StrictJSON, my first Python Package! - Get LLMs to output into a working JSON!
2023-12-20	"Are you smarter than an LLM?" game speedrun
2023-12-08	Is Gemini better than GPT4? Self-created benchmark - Fact Retrieval/Checking, Coding, Tool Use
2023-12-04	Learning, Fast and Slow: 10 Years Plan - Memory Soup, Hier. Planning, Emotions, Knowledge Sharing
2023-12-01	Tutorial #12: Use ChatGPT and off-the-shelf RAG on Terminal/Command Prompt/Shell - SymbolicAI
2023-11-20	JARVIS-1: Multi-modal (Text + Image) Memory + Decision Making with LLMs in MineCraft!
2023-11-20	Tutorial #11: Virtual Persona from Documents, Multi-Agent Chat, Text-to-Speech to hear your Personas
2023-11-14	A Roadmap for AI: Past, Present and Future (Part 3) - Multi-Agent, Multiple Sampling and Filtering
2023-11-07	Learning, Fast and Slow: My Landmark Idea for fast, adaptable agents (ICDL 2023 Best Paper Finalist)
2023-11-06	A roadmap for AI: Past, Present and Future (Part 2): Fixed vs Flexible, Memory Soup vs Hierarchy

Channel	Latest
Hil6175_rblx	6 hours ago
Guillaume & Kim	6 hours ago
Bingtang Xiaokun	6 hours ago
Hijuga	6 hours ago
강자	6 hours ago
Beverlyビバリー	6 hours ago
Garena Free Fire VN	7 hours ago
AgentJ Gaming	7 hours ago
Galih Dys	7 hours ago
Soccer Gameplay	7 hours ago
POWER OF GAME	7 hours ago
笠希々	7 hours ago
Dunkelschloss	7 hours ago
Hendri Pusi	7 hours ago
Yusuke Yamamoto [Otaku President]	7 hours ago
よっしぃ game channel	7 hours ago
フリーランスなおきち広島弁ゲーム実況	7 hours ago
心羽あんch	7 hours ago
Inazuma Hissatsu	7 hours ago
Atomix Knight	7 hours ago
阿德 (藝圓創)	7 hours ago
MRSyonicBoom	7 hours ago
Ray noa	7 hours ago
Tama Ch	7 hours ago
aulddragon	7 hours ago