Skip to main content

Collective extraction from heterogeneous web lists

Publication ,  Journal Article
Machanavajjhala, A; Iyer, A; Bohannon, P; Merugu, S
Published in: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
March 14, 2011

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web. Copyright 2011 ACM.

Duke Scholars

Published In

Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

DOI

Publication Date

March 14, 2011

Start / End Page

445 / 454
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Machanavajjhala, A., Iyer, A., Bohannon, P., & Merugu, S. (2011). Collective extraction from heterogeneous web lists. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, 445–454. https://doi.org/10.1145/1935826.1935894
Machanavajjhala, A., A. Iyer, P. Bohannon, and S. Merugu. “Collective extraction from heterogeneous web lists.” Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, March 14, 2011, 445–54. https://doi.org/10.1145/1935826.1935894.
Machanavajjhala A, Iyer A, Bohannon P, Merugu S. Collective extraction from heterogeneous web lists. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011 Mar 14;445–54.
Machanavajjhala, A., et al. “Collective extraction from heterogeneous web lists.” Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, Mar. 2011, pp. 445–54. Scopus, doi:10.1145/1935826.1935894.
Machanavajjhala A, Iyer A, Bohannon P, Merugu S. Collective extraction from heterogeneous web lists. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011 Mar 14;445–454.

Published In

Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

DOI

Publication Date

March 14, 2011

Start / End Page

445 / 454