Skip to main content

News article extraction with template-independent wrapper

Publication ,  Conference
Wang, J; He, X; Wang, C; Pei, J; Bu, J; Chen, C; Guan, Z; Gang, L
Published in: WWW'09 - Proceedings of the 18th International World Wide Web Conference
December 1, 2009

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites. Copyright is held by the author/owner(s).

Duke Scholars

Published In

WWW'09 - Proceedings of the 18th International World Wide Web Conference

DOI

Publication Date

December 1, 2009

Start / End Page

1085 / 1086
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Wang, J., He, X., Wang, C., Pei, J., Bu, J., Chen, C., … Gang, L. (2009). News article extraction with template-independent wrapper. In WWW’09 - Proceedings of the 18th International World Wide Web Conference (pp. 1085–1086). https://doi.org/10.1145/1526709.1526868
Wang, J., X. He, C. Wang, J. Pei, J. Bu, C. Chen, Z. Guan, and L. Gang. “News article extraction with template-independent wrapper.” In WWW’09 - Proceedings of the 18th International World Wide Web Conference, 1085–86, 2009. https://doi.org/10.1145/1526709.1526868.
Wang J, He X, Wang C, Pei J, Bu J, Chen C, et al. News article extraction with template-independent wrapper. In: WWW’09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 1085–6.
Wang, J., et al. “News article extraction with template-independent wrapper.” WWW’09 - Proceedings of the 18th International World Wide Web Conference, 2009, pp. 1085–86. Scopus, doi:10.1145/1526709.1526868.
Wang J, He X, Wang C, Pei J, Bu J, Chen C, Guan Z, Gang L. News article extraction with template-independent wrapper. WWW’09 - Proceedings of the 18th International World Wide Web Conference. 2009. p. 1085–1086.

Published In

WWW'09 - Proceedings of the 18th International World Wide Web Conference

DOI

Publication Date

December 1, 2009

Start / End Page

1085 / 1086