Biomarker clustering to address correlations in proteomic data.


Journal Article

Correlated variables have been shown to confound statistical analyses in microarray experiments. The same effect applies to an even greater degree in proteomics, especially with the use of MS for parallel measurements. Biological effects such as PTM, fragmentation, and multimer formation can produce strongly correlated variables. The problem is compounded in some types of MS by technical effects such as incomplete chromatographic separation, binding to multiple surfaces, or multiple ionizations. Existing methods for dimension reduction, notably principal components analysis and related techniques, are not always satisfactory because they produce data that often lack clear biological interpretation. We propose a preprocessing algorithm that clusters highly correlated features, using the Bayes information criterion to select an optimal number of clusters. Statistical analysis of clusters, instead of individual features, benefits from lower noise, and reduces the difficulties associated with strongly correlated data. This preprocessing increases the statistical power of analyses using false discovery rate on simulated data. Strong correlations are often present in real data, and we find that clustering improves biomarker discovery in clinical SELDI-TOF-MS datasets of plasma from patients with Kawasaki disease, and bone-marrow cell extracts from patients with acute myeloid or acute lymphoblastic leukemia.

Full Text

Duke Authors

Cited Authors

  • Carlson, SM; Najmi, A; Cohen, HJ

Published Date

  • April 2007

Published In

Volume / Issue

  • 7 / 7

Start / End Page

  • 1037 - 1046

PubMed ID

  • 17390293

Pubmed Central ID

  • 17390293

Electronic International Standard Serial Number (EISSN)

  • 1615-9861

International Standard Serial Number (ISSN)

  • 1615-9853

Digital Object Identifier (DOI)

  • 10.1002/pmic.200600514


  • eng