Building the Next Generation of Conversational AI
Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI
What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology.
They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.
Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.
Key Takeaways:
How Sesame achieves natural voice interactions through real-time speech generation.
The impact of open-sourcing their speech model and what it means for AI research.
The role of full-duplex modeling in improving AI responsiveness.
How computational efficiency and system latency shape AI conversation quality.
The growing role of natural language as a user interface in AI-driven experiences.
For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.
Follow everyone on X:
Ankit Kumar - https://x.com/_apkumar
Anjney Midha - https://x.com/anjneymidha
Check out everything a16z is doing with artificial intelligence, including articles, projects, and more podcasts here – https://a16z.com/ai/