Building the Next Generation of Conversational AI

Channel:
Subscribers:
204,000
Published on ● Video Link: https://www.youtube.com/watch?v=bTcpNQH8ViQ



Duration: 0:00
9,854 views
0


Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI

What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology.

They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.

Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.

Key Takeaways:
How Sesame achieves natural voice interactions through real-time speech generation.
The impact of open-sourcing their speech model and what it means for AI research.
The role of full-duplex modeling in improving AI responsiveness.
How computational efficiency and system latency shape AI conversation quality.
The growing role of natural language as a user interface in AI-driven experiences.

For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.

Follow everyone on X:
Ankit Kumar - https://x.com/_apkumar
Anjney Midha - https://x.com/anjneymidha

Check out everything a16z is doing with artificial intelligence, including articles, projects, and more podcasts here – https://a16z.com/ai/