Skip to main content

Highly efficient algorithms for structural clustering of large websites

Publication ,  Journal Article
Blanco, L; Dalvi, N; Machanavajjhala, A
Published in: Proceedings of the 20th International Conference on World Wide Web, WWW 2011
December 1, 2011

In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled framework, based on the principles of information theory, that allows us to effectively leverage the URLs, and combine them with content and structural properties. Using an extensive evaluation over several large full websites, we demonstrate the effectiveness of our techniques, at a scale unattainable by previous techniques. Copyright © 2011 by the Association for Computing Machinery, Inc. (ACM).

Duke Scholars

Published In

Proceedings of the 20th International Conference on World Wide Web, WWW 2011

DOI

Publication Date

December 1, 2011

Start / End Page

437 / 446
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Blanco, L., Dalvi, N., & Machanavajjhala, A. (2011). Highly efficient algorithms for structural clustering of large websites. Proceedings of the 20th International Conference on World Wide Web, WWW 2011, 437–446. https://doi.org/10.1145/1963405.1963468
Blanco, L., N. Dalvi, and A. Machanavajjhala. “Highly efficient algorithms for structural clustering of large websites.” Proceedings of the 20th International Conference on World Wide Web, WWW 2011, December 1, 2011, 437–46. https://doi.org/10.1145/1963405.1963468.
Blanco L, Dalvi N, Machanavajjhala A. Highly efficient algorithms for structural clustering of large websites. Proceedings of the 20th International Conference on World Wide Web, WWW 2011. 2011 Dec 1;437–46.
Blanco, L., et al. “Highly efficient algorithms for structural clustering of large websites.” Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Dec. 2011, pp. 437–46. Scopus, doi:10.1145/1963405.1963468.
Blanco L, Dalvi N, Machanavajjhala A. Highly efficient algorithms for structural clustering of large websites. Proceedings of the 20th International Conference on World Wide Web, WWW 2011. 2011 Dec 1;437–446.

Published In

Proceedings of the 20th International Conference on World Wide Web, WWW 2011

DOI

Publication Date

December 1, 2011

Start / End Page

437 / 446