AppAgent: Using GPT-4V to Navigate a Smartphone!
I am super intrigued by this paper using a simplified Domain Specific Language to navigate UI on smartphones.
Andrej Karpathy tried to do something similar years ago with Reinforcement Learning and failed.
The new era is here for AI automation over User Interfaces. Using LLM (GPT-4V) to generate its own documentation of the UI elements, and using ReAct framework (Observation, Thought, Action) over a simplified action space (Click on element, Back etc.) yields success rates of over 70% on the tested environments.
This increases to over 90% success rate using human crafted documentation.
The flexible pattern matching abilities of LLMs over a simplified Domain Specific Language (DSL) is indeed powerful.
I look forward to more papers/products using this approach to solve real world problems!
Side Note: GPT-4V cannot really do precise UI navigation yet, so the input view is a combination of the app screenshot, as well as an XML file which details where the UI elements are. It will be interesting to see how far vision models can be improved for more precise UI navigation where we can do use vision-only UI navigation.
Paper: https://arxiv.org/abs/2312.13771
Github: https://github.com/mnotgod96/AppAgent
~~~
Slides: https://github.com/tanchongmin/TensorFlow-Implementations/blob/main/Paper_Reviews/AppAgent.pdf
StrictJSON framework: https://www.youtube.com/watch?v=1N-znDTlhNc
~~~
0:00 Introduction
3:21 Motivation
5:14 ReAct framework
11:25 Overview of AppAgent
22:32 GPT-4V for Image/Text Processing
35:55 Ablation Studies on Image/Text Processing (My Own)
59:38 Domain Specific Language
1:01:03 Exploration Phase
1:19:53 Deployment Phase
1:27:11 Results and Discussion
~~~
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
Discord: https://discord.gg/bzp87AHJy5
LinkedIn: https://www.linkedin.com/in/chong-min-tan-94652288/
Online AI blog: https://delvingintotech.wordpress.com/
Twitter: https://twitter.com/johntanchongmin
Try out my games here: https://simmer.io/@chongmin