Protein classification using summaries of profile-based frequency matrices
The ability to detect or predict the structural class of a protein based on its primary sequence has been a major objective for researchers working in bioinformatics. Within the bioinformatics community, the prevailing belief seems to be that support vector machines (SVMs) are the most effective solution for sequence-based structure prediction. The current state-of-the-art involves SVMs that employ kernel functions designed to compute the similarity between proteins based on profiles generated by the PSI-BLAST alignment algorithm. While effective for problems such as fold recognition or remote homology detection, these kernels are essentially a "black-box" solution to the structure prediction problem. They do not yield a representation that is independent of the SVM. This prevents the user from testing alternative classification algorithms or from using the features for other applications. For example, there may be instances where a researcher is interested in a compact representation of a protein sequence that can be used for problems such as range queries or nearest-neighbor retrieval. We describe such a representation in this work. Using the frequency scores returned by PSI-BLAST, we create a wavelet-based summary. This stand-alone, normalized feature vector drastically reduces the amount of information that needs to be stored for each protein. Though our results are preliminary, empirically, we find that this representation performs well in both experiments dealing with fold recognition and provides accuracy comparable to the state-of-the-art for remote homology detection. At the same time, we find that it is also effective for protein indexing and retrieval.