Exploiting comparable corpora

Channel:

Subscribers:

351,000

Published on September 7, 2016 4:40:59 PM ● Video Link: https://www.youtube.com/watch?v=nKZcTugnPiU

Duration: 1:05:52

302 views

One of the major bottlenecks in the development of Statistical Machine Translation systems for most language pairs is the lack of bilingual parallel training data. Currently available parallel corpora span relatively few language pairs and very few domains; building new ones of sufficiently large size and high quality is time-consuming and expensive. In this talk, I will present methods that enable automatic creation of parallel corpora by exploiting a rich, diverse, and readily available resource: comparable corpora. Comparable corpora are bilingual texts that, while not parallel in the strict sense, are somewhat related and convey overlapping information. Such texts exist in large quantities on the Web; a good example are the multilingual news feeds produced by news agencies such as Agence France Presse, CNN, and BBC. I will present novel methods for extracting good-quality parallel data from such comparable collections. I will show how to detect parallelism at various granularity levels, and thus find parallel documents (if there are any in the collection), parallel sentences, and parallel sub-sentential fragments. In order to demonstrate the validity of this approach, I use my method to extract data from large-scale comparable corpora for various language pairs, and show that the extracted data helps improve the end-to-end performance of a state-of-the art machine translation system.

Other Videos By Microsoft Research

2016-09-07	Faster Decoding with Synchronous Grammars and n-gram Language Models
2016-09-07	Locality and Phases: Dynamic Structures in Large-Scale Program Behavior
2016-09-07	Inversion Transduction Grammar with Linguistic Constraints
2016-09-07	How scheduling theory, scenarios, model checking and slicing can help in the verification of RTS
2016-09-07	Innovention - the process of innovation and invention
2016-09-07	Security and Privacy in Radio Frequency Identification
2016-09-07	Conference XP - Automated Tracking of Student Behaviors
2016-09-07	From Models to Systems: Applications of Model-based Design to Modern Large-Scale Systems
2016-09-07	Splitting on Demand in Satisfiability Modulo Theories
2016-09-07	Making Semiconductors Ferromagnetic: Reasons, Challenges, and Opportunities
2016-09-07	Exploiting comparable corpora
2016-09-07	Invisible Engines: How Software Platforms Drive Innovation┬á┬á┬á┬á┬á┬á┬á┬á
2016-09-07	Towards Documenting and Automating Collateral Evolutions in Linux Device Driver
2016-09-07	Phonological Licensing of Grammatical Morphology in Early Speech
2016-09-07	Purpose: The Starting Point of Great Companies ┬á┬á┬á┬á┬á┬á┬á┬á┬á
2016-09-07	Location, Time and Context in Systems: Rover - An Example
2016-09-07	Exploring Tools and Techniques for Distributed Continuous Quality Assurance
2016-09-07	QuickSilver Scalable Multicast
2016-09-07	Splitting Interfaces: Making Trust Between Applications and Operating Systems Configurable
2016-09-07	Conference XP Project Update
2016-09-07	Relational Databases in the Social and Health Sciences: The View from Demography

Tags:

microsoft research

Channel	Latest
Joshua And Friends	7 hours ago
🐺 lonestarwolf94 🐺	8 hours ago
The Matthews Fam	8 hours ago
BENBROS	8 hours ago
Animations Trailer	8 hours ago
CoraToons	8 hours ago
Oma Rohaeti	8 hours ago
Kelsey Off Grid	8 hours ago
Yosy de Galicia	8 hours ago
Waccau Gameplay	8 hours ago
Yogi Akbar	9 hours ago
Hiro	9 hours ago
R湯哥	9 hours ago
Claireinium	9 hours ago
Rivas	9 hours ago
Time Hack	9 hours ago
Cheap Thrills Arkill	9 hours ago
Zanpact_Musashi Gaming	9 hours ago
AMHarbinger	9 hours ago
PepePeepo	9 hours ago
MelodyShortMusic	9 hours ago
Cipher Games	9 hours ago
Blurrhhh	9 hours ago
Dương Dê	9 hours ago
MLBB EPIC PLAYS	10 hours ago