Should Machines Emulate Human Speech Recognition?

Subscribers:
345,000
Published on ● Video Link: https://www.youtube.com/watch?v=AItXcykHjqQ



Duration: 1:26:57
63 views
1


Machine-based, automatic speech recognition (ASR) systems decode the acoustic signal by associating each time frame with a set of phonetic-segment possibilities. And from such matrices of segment probabilities, word hypotheses are formed. This segment-based, serial time-frame approach has been standard practice in ASR for many years. Although ASRΓÇÖs reliability has improved dramatically in recent years, such advances have often relied on huge amounts of training material and an expert team of developers. Might there be a simpler, faster way to develop ASR applications, one that adapts quickly to novel linguistic situations and challenging acoustic environments? It is the thesis of this presentation that future-generation ASR should be based (in part) on strategies used by human listeners to decode the speech signal. A comprehensive theoretical framework will be described, one based on a variety of perceptual, statistical and machine-learning studies. This Multi-Tier framework focuses on the interaction across different levels of linguistic organization. Words are composed of more than segments, and utterances consist of (far) more than words. In Multi-Tier Theory, the syllable serves as the interface between sound (as well as vision) and meaning. Units smaller than the syllable (such as the segment, and articulatory-acoustic features), combine with larger units (e.g., the lexeme and prosodic phrase) to provide a more balanced perspective than afforded by the conventional word/segment framework used in ASR. The presentation will consider (in some detail) how the brain decodes consonants, and how such knowledge can be used to deduce the perceptual flow of phonetic processing. The presentation will conclude with a discussion of how human speech-decoding strategies can (realistically) be used to improve the performance of automatic speech recognition (in machines).




Other Videos By Microsoft Research


2016-09-06Media Computation: Introducing Computing Contextualized in Video and Audio Processing
2016-09-06MOSAIC: Unified Platform for Dynamic Overlay Selection and Composition
2016-09-06Computational Insights Into the Social Life of Zebras and Other Animals
2016-09-06Debugging Reinvented: Asking and Answering Why and Why Not Questions about Program Behavior [1/17]
2016-09-06CitySense: A Vision for an Urban-Scale Wireless Sensor Testbed
2016-09-06Why task-structure matters: The effects of task and social forces on software development
2016-09-06Robust Face Recognition via Sparse Representation
2016-09-06How to make Discretionary Access Control Resistant to Trojan Horses
2016-09-06Modeling Intention in Email: Speech Acts, Information Leaks and User Ranking Methods [1/2]
2016-09-06Techniques and Tools for Engineering Secure Web Applications
2016-09-06Should Machines Emulate Human Speech Recognition?
2016-09-06PLOW: A Collaborative Task Learning Agent
2016-09-06Building Bodies of Knowledge about Software Development Practices
2016-09-06The Manticore Project
2016-09-06Abstractions for event-driven design [1/14]
2016-09-06Generation of dense linear algebra software for shared memory and multicore architectures
2016-09-06The Computation of Economic equilibria [1/2]
2016-09-06Class Morphing: Safely Shaping a Class in the Image of Others [1/3]
2016-09-06Deep Photo and Gigapixel Images
2016-09-06Automated Revision of Distributed and Real-Time Programs
2016-09-06Candidate talk: Knowledge Analysis towards Automatic Question Answering for Discussion Forums



Tags:
microsoft research