DocEng 2011: An Efficient Language-Independent Method to Extract Content from News Webpages

Subscribers:
348,000
Published on ● Video Link: https://www.youtube.com/watch?v=4xc2VmaI4dI



Duration: 19:14
1,359 views
10


The 11th ACM Symposium on Document Engineering
Mountain View, California, USA
September 19-22, 2011

An Efficient Language-Independent Method to Extract Content from News Webpages
Eduardo Teixeira Cardoso, Iam Jabour, Eduardo Laber, Rogério Ferreira Rodrigues, Pedro Lazéra Cardoso
Presented by Eduardo Teixeira Cardoso.

ABSTRACT

We tackle the task of news webpage segmentation, specifically identifying the news title, publication date and story body. While there are very good results in the literature, most of them rely on webpage rendering, which is a very time-consuming step. We focus on scenarios with a high volume of documents, where performance is a must. The chosen approach extends our previous work in the area, combining structural properties with hints of visual presentation styles, computed with a quicker method than regular rendering, and machine learning algorithms. In our experiments, we took special attention to some aspects that are often overlooked in the literature, such as processing time and the generalization of the extraction results for unseen domains. Our approach has shown to be about an order of magnitude faster than an equivalent full rendering alternative while retaining a good quality of extraction.




Other Videos By Google TechTalks


2011-10-04DocEng 2011: Document Visual Similarity Measure For Document Search
2011-10-04DocEng 2011: A Versatile Model for Web Page Representation
2011-10-03DocEng 2011: Expressing Conditions in Tailored Brochures for Public Administration
2011-10-03DocEng 2011: Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection
2011-10-03DocEng 2011: A Study of the Interaction of Paper Substrates on Printed Forensic Imaging
2011-09-302011 Frontiers of Engineering: Automatic Text Understanding of Content and Text Quality
2011-09-29DocEng 2011: Interoperable Metadata Semantics
2011-09-292011 Frontiers of Engineering: Large Scale Visual Semantic Extraction
2011-09-292011 Frontiers of Engineering: Advancing Natural Language Understanding
2011-09-282011 Frontiers of Engineering: Research at Google Lightning Talks
2011-09-28DocEng 2011: An Efficient Language-Independent Method to Extract Content from News Webpages
2011-09-28DocEng 2011: Dynamic Assistance to Adding Dimensions to Multi-structured Documents
2011-09-28DocEng 2011: Component-based Hypervideo Model
2011-09-282011 U.S. Frontiers of Engineering: Welcome and Opening Remarks
2011-09-282011 U.S. Frontiers of Engineering: Overview of Additive Manufacturing
2011-09-27DocEng 2011: Timesheets - When SMIL Meets HTML5 and CSS3
2011-09-27DocEng 2011: A Generic Calculus of XML Editing Deltas
2011-09-272011 U.S. Frontiers of Engineering: Dinner Speech by Alfred Spector
2011-09-26DocEng 2011: Keynote Address - Google+ Internationalization
2011-09-26DocEng 2011: Optimal Automatic Table Layout
2011-09-26DocEng 2011: Building Table Formatting Tools



Tags:
google tech talk
doceng 2011
document engineering