DocEng 2011: A Versatile Model for Web Page Representation

Subscribers:
348,000
Published on ● Video Link: https://www.youtube.com/watch?v=tPrg6U0yljs



Duration: 29:09
567 views
6


The 11th ACM Symposium on Document Engineering
Mountain View, California, USA
September 19-22, 2011

A Versatile Model for Web Page Representation, Information Extraction and Content Re-Packaging
Bernhard Krüpl-Sypien, Ruslan Fayzrakhmanov, Wolfgang Holzinger, Mathias Panzenböck, Robert Baumgartner
Presented by Bernhard Krüpl-Sypien.

ABSTRACT

On todays Web, designers take huge efforts to create visu- ally rich websites that boast a magnitude of interactive ele- ments. Contrarily, most web information extraction (WIE) algorithms are still based on attributed tree methods which struggle to deal with this complexity. In this paper, we in- troduce a versatile model to represent web documents. The model is based on gestalt theory principlestrying to cap- ture the most important aspects in a formally exact way. It (i) represents and unifies access to visual layout, content and functional aspects; (ii) is implemented with semantic web techniques that can be leveraged for i.e. automatic reason- ing. Considering the visual appearance of a web page, we view it as a collection of gestalt figuresbased on gestalt primitiveseach representing a specific design pattern, be it navigation menus or news articles. Based on this model, we introduce our WIE methodology, a re-engineering pro- cess involving design patterns, statistical distributions and text content properties. The complete framework consists of the UOM model, which formalizes the mentioned com- ponents, and the MANM layer that hints on structure and serialization, providing document re-packaging foundations. Finally, we discuss how we have applied and evaluated our model in the area of web accessibility.




Other Videos By Google TechTalks


2011-10-052011 Frontiers of Engineering: Additive Manufacturing is Changing Surgery
2011-10-052011 Frontiers of Engineering: Expanding Design Spaces
2011-10-052011 Frontiers of Engineering: Additive Manufacturing in Aerospace
2011-10-042011 Frontiers of Engineering: Challenges and Opportunities for Low-Carbon Buildings
2011-10-042011 Frontiers of Engineering: Multi-Scale Modeling of Sustainable Buildings
2011-10-042011 Frontiers of Engineering: Accelerating Green Building Market Transformation with IT
2011-10-042011 Frontiers of Engineering: Where Are the Emerging Frontiers in Research and Innovation?
2011-10-04DocEng 2011: Reflowable Documents Composed from Pre-rendered Atomic Components
2011-10-04DocEng 2011: Paginate Dynamic and Web Content
2011-10-04DocEng 2011: Document Visual Similarity Measure For Document Search
2011-10-04DocEng 2011: A Versatile Model for Web Page Representation
2011-10-03DocEng 2011: Expressing Conditions in Tailored Brochures for Public Administration
2011-10-03DocEng 2011: Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection
2011-10-03DocEng 2011: A Study of the Interaction of Paper Substrates on Printed Forensic Imaging
2011-09-302011 Frontiers of Engineering: Automatic Text Understanding of Content and Text Quality
2011-09-29DocEng 2011: Interoperable Metadata Semantics
2011-09-292011 Frontiers of Engineering: Large Scale Visual Semantic Extraction
2011-09-292011 Frontiers of Engineering: Advancing Natural Language Understanding
2011-09-282011 Frontiers of Engineering: Research at Google Lightning Talks
2011-09-28DocEng 2011: An Efficient Language-Independent Method to Extract Content from News Webpages
2011-09-28DocEng 2011: Dynamic Assistance to Adding Dimensions to Multi-structured Documents



Tags:
google tech talk
doceng 2011
document engineering