Grounded Visual Generation
Channel:
Subscribers:
351,000
Published on ● Video Link: https://www.youtube.com/watch?v=fLqF4isdWPg
Multi-modal data provides an exciting opportunity to train grounded generative models that synthesize images consistent with real world phenomena. In this talk, I will share several of our recent efforts towards creating grounded visual generation models: (1) introducing user attention grounding for text-to-image synthesis, (2) improving text-to-image generation results with stronger language grounding, and (3) taking steps towards creating spatially grounded world models for embodied vision-and-language tasks.
Speaker: Jing Yu Koh, Google
MSR Deep Learning team: https://www.microsoft.com/en-us/research/group/deep-learning-group/