Drug discovery using very large numbers of patents: general strategy with extensive use of match and edit operations.
A patent data base of 6.7 million compounds generated by a very high performance computer (Blue Gene) requires new techniques for exploitation when extensive use of chemical similarity is involved. Such exploitation includes the taxonomic classification of chemical themes, and data mining to assess mutual information between themes and companies. Importantly, we also launch candidates that evolve by "natural selection" as failure of partial match against the patent data base and their ability to bind to the protein target appropriately, by simulation on Blue Gene. An unusual feature of our method is that algorithms and workflows rely on dynamic interaction between match-and-edit instructions, which in practice are regular expressions. Similarity testing by these uses SMILES strings and, less frequently, graph or connectivity representations. Examining how this performs in high throughput, we note that chemical similarity and novelty are human concepts that largely have meaning by utility in specific contexts. For some purposes, mutual information involving chemical themes might be a better concept.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Small Molecule Libraries
- Pattern Recognition, Automated
- Patents as Topic
- Medicinal & Biomolecular Chemistry
- Information Storage and Retrieval
- Image Interpretation, Computer-Assisted
- Humans
- Drug Discovery
- Databases, Factual
- Data Interpretation, Statistical
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Small Molecule Libraries
- Pattern Recognition, Automated
- Patents as Topic
- Medicinal & Biomolecular Chemistry
- Information Storage and Retrieval
- Image Interpretation, Computer-Assisted
- Humans
- Drug Discovery
- Databases, Factual
- Data Interpretation, Statistical