Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
A Google TechTalk, presented by Meera Hahn, 2024-12-05
ABSTRACT: User prompts for generative AI models are often underspecified or open-ended, which may lead to suboptimal responses. This prompt underspecification problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to actively ask clarification questions when uncertain, and present their understanding of user intent as an interpretable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we use a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth.
Speaker Bio:
Meera Hahn is a Research Scientist at Google Deepmind, working predominantly at the intersection of computer vision and natural language processing. She joined Google in 2022 after completing her PhD at Georgia Tech. Her research interests include embodied AI, text based navigation and localization, text to image and video generation, and general multimodal AI tasks. To learn more about her research visit her homepage at https://meerahahn.github.io/