Mining frequent cross-graph quasi-cliques
Joint mining of multiple datasets can often discover interesting, novel, and reliable patterns which cannot be obtained solely from any single source. For example, in bioinformatics, jointly mining multiple gene expression datasets obtained by different labs or during various biological processes may overcome the heavy noise in the data. Moreover, by joint mining of gene expression data and protein-protein interaction data, we may discover clusters of genes which show coherent expression patterns and also produce interacting proteins. Such clusters may be potential pathways. In this article, we investigate a novel data mining problem, mining frequent cross-graph quasi-cliques, which is generalized from several interesting applications in bioinformatics, cross-market customer segmentation, social network analysis, and Web mining. In a graph, a set of vertices S is a -quasi-clique (0 < 1) if each vertex v in S directly connects to at least (S - 1) other vertices in S. Given a set of graphs G1, , Gn and parameter min-sup (0 < min-sup 1), a set of vertices S is a frequent cross-graph quasi-clique if S is a -quasi-clique in at least min-sup n graphs, and there does not exist a proper superset of S having the property. We build a general model, show why the complete set of frequent cross-graph quasi-cliques cannot be found by previous data mining methods, and study the complexity of the problem. While the problem is difficult, we develop practical algorithms which exploit several interesting and effective techniques and heuristics to efficaciously mine frequent cross-graph quasi-cliques. A systematic performance study is reported on both synthetic and real data sets. We demonstrate some interesting and meaningful frequent cross-graph quasi-cliques in bioinformatics. The experimental results also show that our algorithms are efficient and scalable. © 2009 ACM.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Related Subject Headings
- Artificial Intelligence & Image Processing
- 4606 Distributed computing and systems software
- 4605 Data management and data science
- 4604 Cybersecurity and privacy
- 0806 Information Systems
- 0801 Artificial Intelligence and Image Processing
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Related Subject Headings
- Artificial Intelligence & Image Processing
- 4606 Distributed computing and systems software
- 4605 Data management and data science
- 4604 Cybersecurity and privacy
- 0806 Information Systems
- 0801 Artificial Intelligence and Image Processing