Skip to main content
Journal cover image

Document clustering of scientific texts using citation contexts

Publication ,  Journal Article
Aljaber, B; Stokes, N; Bailey, J; Pei, J
Published in: Information Retrieval
April 1, 2010

Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document's textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets. © 2009 Springer Science+Business Media, LLC.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Information Retrieval

DOI

EISSN

1573-7659

ISSN

1386-4564

Publication Date

April 1, 2010

Volume

13

Issue

2

Start / End Page

101 / 131

Related Subject Headings

  • Information & Library Sciences
  • 46 Information and computing sciences
  • 0807 Library and Information Studies
  • 0806 Information Systems
  • 0804 Data Format
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Aljaber, B., Stokes, N., Bailey, J., & Pei, J. (2010). Document clustering of scientific texts using citation contexts. Information Retrieval, 13(2), 101–131. https://doi.org/10.1007/s10791-009-9108-x
Aljaber, B., N. Stokes, J. Bailey, and J. Pei. “Document clustering of scientific texts using citation contexts.” Information Retrieval 13, no. 2 (April 1, 2010): 101–31. https://doi.org/10.1007/s10791-009-9108-x.
Aljaber B, Stokes N, Bailey J, Pei J. Document clustering of scientific texts using citation contexts. Information Retrieval. 2010 Apr 1;13(2):101–31.
Aljaber, B., et al. “Document clustering of scientific texts using citation contexts.” Information Retrieval, vol. 13, no. 2, Apr. 2010, pp. 101–31. Scopus, doi:10.1007/s10791-009-9108-x.
Aljaber B, Stokes N, Bailey J, Pei J. Document clustering of scientific texts using citation contexts. Information Retrieval. 2010 Apr 1;13(2):101–131.
Journal cover image

Published In

Information Retrieval

DOI

EISSN

1573-7659

ISSN

1386-4564

Publication Date

April 1, 2010

Volume

13

Issue

2

Start / End Page

101 / 131

Related Subject Headings

  • Information & Library Sciences
  • 46 Information and computing sciences
  • 0807 Library and Information Studies
  • 0806 Information Systems
  • 0804 Data Format