Exploiting comparable corpora

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=nKZcTugnPiU



Duration: 1:05:52
302 views
4


One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains; building new ones of sufficiently large size and high quality is time-consuming and expensive. In this talk, I will present methods that enable automatic creation of parallel corpora by exploiting a rich, diverse, and readily available resource: comparable corpora. Comparable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Such texts exist in large quantities on the Web; a good example are the multilingual news feeds produced by news agencies such as Agence France Presse, CNN, and BBC. I will present novel methods for extracting good-quality parallel data from such comparable collections. I will show how to detect parallelism at various granularity levels, and thus find parallel documents (if there are any in the collection), parallel sentences, and parallel sub-sentential fragments. In order to demonstrate the validity of this approach, I use my method to extract data from large-scale comparable corpora for various language pairs, and show that the extracted data helps improve the end-to-end performance of a state-of-the art machine translation system.




Other Videos By Microsoft Research


2016-09-07Faster Decoding with Synchronous Grammars and n-gram Language Models
2016-09-07Locality and Phases: Dynamic Structures in Large-Scale Program Behavior
2016-09-07Inversion Transduction Grammar with Linguistic Constraints
2016-09-07How scheduling theory, scenarios, model checking and slicing can help in the verification of RTS
2016-09-07Innovention - the process of innovation and invention
2016-09-07Security and Privacy in Radio Frequency Identification
2016-09-07Conference XP - Automated Tracking of Student Behaviors
2016-09-07From Models to Systems: Applications of Model-based Design to Modern Large-Scale Systems
2016-09-07Splitting on Demand in Satisfiability Modulo Theories
2016-09-07Making Semiconductors Ferromagnetic: Reasons, Challenges, and Opportunities
2016-09-07Exploiting comparable corpora
2016-09-07Invisible Engines: How Software Platforms Drive Innovation        
2016-09-07Towards Documenting and Automating Collateral Evolutions in Linux Device Driver
2016-09-07Phonological Licensing of Grammatical Morphology in Early Speech
2016-09-07Purpose: The Starting Point of Great Companies          
2016-09-07Location, Time and Context in Systems: Rover - An Example
2016-09-07Exploring Tools and Techniques for Distributed Continuous Quality Assurance
2016-09-07QuickSilver Scalable Multicast
2016-09-07Splitting Interfaces: Making Trust Between Applications and Operating Systems Configurable
2016-09-07Conference XP Project Update
2016-09-07Relational Databases in the Social and Health Sciences: The View from Demography



Tags:
microsoft research