Skip to main content

Clustering uncertain data based on probability distribution similarity

Publication ,  Journal Article
Jiang, B; Pei, J; Tao, Y; Lin, X
Published in: IEEE Transactions on Knowledge and Data Engineering
March 11, 2013

Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods. The previous methods extend traditional partitioning clustering methods like (k)-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects. Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings. Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects. In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively. We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects. Nevertheless, a naïve implementation is very costly. Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible. To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation. Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches. © 2012 IEEE.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

IEEE Transactions on Knowledge and Data Engineering

DOI

ISSN

1041-4347

Publication Date

March 11, 2013

Volume

25

Issue

4

Start / End Page

751 / 763

Related Subject Headings

  • Information Systems
  • 46 Information and computing sciences
  • 08 Information and Computing Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Jiang, B., Pei, J., Tao, Y., & Lin, X. (2013). Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering, 25(4), 751–763. https://doi.org/10.1109/TKDE.2011.221
Jiang, B., J. Pei, Y. Tao, and X. Lin. “Clustering uncertain data based on probability distribution similarity.” IEEE Transactions on Knowledge and Data Engineering 25, no. 4 (March 11, 2013): 751–63. https://doi.org/10.1109/TKDE.2011.221.
Jiang B, Pei J, Tao Y, Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering. 2013 Mar 11;25(4):751–63.
Jiang, B., et al. “Clustering uncertain data based on probability distribution similarity.” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, Mar. 2013, pp. 751–63. Scopus, doi:10.1109/TKDE.2011.221.
Jiang B, Pei J, Tao Y, Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Transactions on Knowledge and Data Engineering. 2013 Mar 11;25(4):751–763.

Published In

IEEE Transactions on Knowledge and Data Engineering

DOI

ISSN

1041-4347

Publication Date

March 11, 2013

Volume

25

Issue

4

Start / End Page

751 / 763

Related Subject Headings

  • Information Systems
  • 46 Information and computing sciences
  • 08 Information and Computing Sciences