Scholars@Duke publication: Can we learn a template-independent wrapper for news article extraction from a single training site?

Can we learn a template-independent wrapper for news article extraction from a single training site?

Publication , Conference

Wang, J; Chen, C; Wang, C; Pei, J; Bu, J; Guan, Z; Zhang, WV

Published in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

November 16, 2009

Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on templatelevel wrapper induction have three serious limitations. First, the existing methods cannot correctly extract pages belonging to an unseen template. Second, it is costly to maintain up-to-date wrappers for a large amount of news websites, because any change of a template may invalidate the corresponding wrapper. Last, the existing methods can merely extract unformatted plain texts, and thus are not user friendly. In this paper, we tackle the problem of template-independent Web news extraction in a user-friendly way. We formalizeWeb news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed. Correlations between news titles and news bodies are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. Moreover, our approach can extract not only texts, but also images andanimates within the news bodies and the extracted news articles are in the same visual style as in the original pages. In our experiments, a wrapper learned from 40 pages from a single news site achieved an accuracy of 98.1% on 3, 973 news pages from 12 news sites. Copyright 2009 ACM.

Duke Scholars

Author Jian Pei Computer Science

Published In

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

DOI

10.1145/1557019.1557163

Publication Date

November 16, 2009

Start / End Page

1345 / 1353

Citation

APA

Chicago

ICMJE

MLA

NLM

Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., & Zhang, W. V. (2009). Can we learn a template-independent wrapper for news article extraction from a single training site? In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1345–1353). https://doi.org/10.1145/1557019.1557163

Wang, J., C. Chen, C. Wang, J. Pei, J. Bu, Z. Guan, and W. V. Zhang. “Can we learn a template-independent wrapper for news article extraction from a single training site?” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1345–53, 2009. https://doi.org/10.1145/1557019.1557163.

Wang J, Chen C, Wang C, Pei J, Bu J, Guan Z, et al. Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009. p. 1345–53.

Wang, J., et al. “Can we learn a template-independent wrapper for news article extraction from a single training site?” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, pp. 1345–53. Scopus, doi:10.1145/1557019.1557163.

Wang J, Chen C, Wang C, Pei J, Bu J, Guan Z, Zhang WV. Can we learn a template-independent wrapper for news article extraction from a single training site? Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009. p. 1345–1353.

Published In

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

DOI

10.1145/1557019.1557163

Publication Date

November 16, 2009

Start / End Page

1345 / 1353