Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=vcyB8xb1-ys



Duration: 1:06:53
4,240 views
29


Speaker diarization consist of automatically partitioning an input audio stream into homogeneous segments (segmentation) and assigning these segments to the same speaker (speaker clustering). This process can allow to enhance the readability by structuring an audio document, or provide the speaker's true identity when it's used in conjunction with speaker recognition system. In this seminar I will talk about two new methods: ILP Clustering and Speaker embeddings. In speaker clustering, a major problem with using greedy agglomerative hierarchical clustering (HAC) is that it does not guarantee an optimal solution. I propose a new clustering model (called ILP Clustering), by redefining clustering problem as a linear program (ie. linear program is defined by an objective function and subject to linear equality and/or linear inequality constraint). Thus an Integer Linear Programming (ILP) solver can be used to search the optimal solution over the whole problem. In a second part, I propose to learn a set of high-level feature representations through deep learning, referred to as speaker embeddings. Speaker embedding features are taken from the hidden layer neuron activations of Deep Neural Networks (DNN), when learned as classifiers to recognize a thousand speaker identities in a training set. Although learned through identification, the speaker embeddings are shown to be effective for speaker verification in particular to recognize speakers' unseen in the training set. The experiments were conducted on the corpus of French broadcast news ETAPE where these new methods based on ILP/speaker-embeddings decreases DER by 4.79 points over the baseline diarization system based on HAC/GMM.




Other Videos By Microsoft Research


2016-06-13What are the prospects for automatic theorem proving?
2016-06-13Towards Understandable Neural Networks for High Level AI Tasks - Part 3
2016-06-13Artist in Residence (formerly Studio99) Presents: Michael Gough and "Drawing as Literacy."
2016-06-13Towards Cross-fertilization Between Propositional Satisfiability and Data Mining
2016-06-13Making Objects Count: A Shape Analysis Framework for Proving Polynomial Time Termination
2016-06-13Human factors of software updates
2016-06-13Machine-Checked Correctness and Complexity of a Union-Find Implementation
2016-06-13Applications of 3-Dimensional Spherical Transforms to Acoustics and Personalization of Head-related
2016-06-13Network Protocols: Myths, Missteps, and Mysteries
2016-06-13Optimal and Adaptive Online Learning
2016-06-13Speaker Diarization: Optimal Clustering and Learning Speaker Embeddings
2016-06-13Multi-rate neural networks for efficient acoustic modeling
2016-06-13Unsupervised Latent Faults Detection in Data Centers
2016-06-13System and Toolchain Support for Reliable Intermittent Computing
2016-06-13Gates Foundation Presents: Crucial Areas of Fintech Innovation for the Bottom of the Pyramid
2016-06-13Social Computing Symposium 2016: Harassment, Threats, Trolling Online, Diversity in Gaming is Vital
2016-06-13Bringing Harmony Through AI and Economics
2016-06-13Approximating Integer Programming Problems by Partial Resampling
2016-06-13A Lasserre-Based (1+epsilon)-Approximation for Makespan Scheduling with Precedence Constraints
2016-06-13Towards Understandable Neural Networks for High Level AI Tasks - Part 7
2016-06-13Verasco, a formally verified C static analyzer



Tags:
microsoft research
speaker diarization
speaker clustering
ilp clustering
deep neural networks
natural language processing and speech