Selection of Informative Examples in Chemogenomic Datasets.
High-throughput and high-content screening campaigns have resulted in the creation of large chemogenomic matrices. These matrices form the training data which is used to build ligand-target interaction models for pharmacological and chemical biology research. While academic, government, and industrial efforts continuously add to the ligand-target data pairs available for modeling, major research efforts are devoted to improving machine learning techniques to cope with the sparseness, heterogeneity, and size of available datasets as well as inherent noise and bias. This "race of arms" has led to the creation of algorithms to generate highly complex models with high prediction performance at the cost of training efficiency as well as interpretability.In contrast, recent studies have challenged the necessity for "big data" in chemogenomic modeling and found that models built on larger numbers of examples do not necessarily result in better predictive abilities. Automated adaptive selection of the training data (ligand-target instances) used for model creation can result in considerably smaller training sets that retain prediction performance on par with training using hundreds of thousands of data points. In this chapter, we describe the protocols used for one such iterative chemogenomic selection technique, including model construction and update as well as possible techniques for evaluations of constructed models and analysis of the iterative model construction.
Volume / Issue
Start / End Page
Digital Object Identifier (DOI)