Scholars@Duke publication: On the use of structure and sequence-based features for protein classification and retrieval

On the use of structure and sequence-based features for protein classification and retrieval

Publication , Conference

Marsolo, K; Parthasarathy, S

Published in: Proceedings - IEEE International Conference on Data Mining, ICDM

December 1, 2006

The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. In drug discovery, researchers search for proteins that share specific chemical properties as possible sources for new treatment. With folding simulations, similar intermediate structures might be indicative of a common folding pathway. To derive any type of similarity, however, one must have an effective model of the protein that allows for easy comparison. In this work, we present two normalized, stand-alone representations of proteins that enable fast and efficient object retrieval based on sequence or structure. To create our sequence-based representation, we take the frequency and scoring matrices returned by the PSI-BLAST alignment algorithm and create a normalized summary using a discrete wavelet transform. Our structural descriptor is constructed using an algorithm we developed previously. First, we transform each 3D structure into a 2D distance matrix by calculating the pair-wise distance between the amino acids of a protein. We normalize this matrix and apply a 2D wavelet decomposition to generate a set of approximation coefficients, which serve as our feature vector. We also concatenate the sequence and structural descriptors together to create a hybrid solution. We evaluate the generality of our models by using them as database indices for nearest-neighbor and range-based retrieval experiments as well as feature vectors for classification using support vector machines. We find that our methods provide excellent performance when compared with the current state-of-the-art techniques of each task. Our results show that the sequence-based representation is on par with, or out-performs, the structure-based representation. Moreover, we find that in the classification context, the hybrid strategy affords a significant improvement over sequence or structure. © 2006 IEEE.

Duke Scholars

Author Keith Allen Marsolo Population Health Sciences

Published In

Proceedings - IEEE International Conference on Data Mining, ICDM

DOI

10.1109/ICDM.2006.119

ISSN

1550-4786

ISBN

9780769527017

Publication Date

December 1, 2006

Start / End Page

394 / 403

Citation

APA

Chicago

ICMJE

MLA

NLM

Marsolo, K., & Parthasarathy, S. (2006). On the use of structure and sequence-based features for protein classification and retrieval. In Proceedings - IEEE International Conference on Data Mining, ICDM (pp. 394–403). https://doi.org/10.1109/ICDM.2006.119

Marsolo, K., and S. Parthasarathy. “On the use of structure and sequence-based features for protein classification and retrieval.” In Proceedings - IEEE International Conference on Data Mining, ICDM, 394–403, 2006. https://doi.org/10.1109/ICDM.2006.119.

Marsolo K, Parthasarathy S. On the use of structure and sequence-based features for protein classification and retrieval. In: Proceedings - IEEE International Conference on Data Mining, ICDM. 2006. p. 394–403.

Marsolo, K., and S. Parthasarathy. “On the use of structure and sequence-based features for protein classification and retrieval.” Proceedings - IEEE International Conference on Data Mining, ICDM, 2006, pp. 394–403. Scopus, doi:10.1109/ICDM.2006.119.

Marsolo K, Parthasarathy S. On the use of structure and sequence-based features for protein classification and retrieval. Proceedings - IEEE International Conference on Data Mining, ICDM. 2006. p. 394–403.

Published In

Proceedings - IEEE International Conference on Data Mining, ICDM

DOI

10.1109/ICDM.2006.119

ISSN

1550-4786

ISBN

9780769527017

Publication Date

December 1, 2006

Start / End Page

394 / 403