Scholars@Duke publication: Collective extraction from heterogeneous web lists

Collective extraction from heterogeneous web lists

Publication , Journal Article

Machanavajjhala, A; Iyer, A; Bohannon, P; Merugu, S

Published in: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

March 14, 2011

Automatic extraction of structured records from inconsistently formatted lists on the web is challenging: different lists present disparate sets of attributes with variations in the ordering of attributes; many lists contain additional attributes and noise that can confuse the extraction process; and formatting within a list may be inconsistent due to missing attributes or manual formatting on some sites. We present a novel solution to this extraction problem that is based on i) collective extraction from multiple lists simultaneously and ii) careful exploitation of a small database of seed entities. Our approach addresses the layout homogeneity within the individual lists, content redundancy across some snippets from different sources, and the noisy attribute rendering process. We experimentally evaluate variants of this algorithm on real world data sets and show that our approach is a promising direction for extraction from noisy lists, requiring mild and thus inexpensive supervision suitable for extraction from the tail of the web. Copyright 2011 ACM.

Duke Scholars

Author Ashwinkumar Venkatanaga Machanavajjhala Computer Science

Published In

Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

DOI

10.1145/1935826.1935894

Publication Date

March 14, 2011

Start / End Page

445 / 454

Citation

APA

Chicago

ICMJE

MLA

NLM

Machanavajjhala, A., Iyer, A., Bohannon, P., & Merugu, S. (2011). Collective extraction from heterogeneous web lists. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, 445–454. https://doi.org/10.1145/1935826.1935894

Machanavajjhala, A., A. Iyer, P. Bohannon, and S. Merugu. “Collective extraction from heterogeneous web lists.” Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, March 14, 2011, 445–54. https://doi.org/10.1145/1935826.1935894.

Machanavajjhala A, Iyer A, Bohannon P, Merugu S. Collective extraction from heterogeneous web lists. Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011. 2011 Mar 14;445–54.

Machanavajjhala, A., et al. “Collective extraction from heterogeneous web lists.” Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011, Mar. 2011, pp. 445–54. Scopus, doi:10.1145/1935826.1935894.

Published In

Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011

DOI

10.1145/1935826.1935894

Publication Date

March 14, 2011

Start / End Page

445 / 454