Sequence features of DNA binding sites reveal structural class of associated transcription factor.

Published

Journal Article

MOTIVATION: A key goal in molecular biology is to understand the mechanisms by which a cell regulates the transcription of its genes. One important aspect of this transcriptional regulation is the binding of transcription factors (TFs) to their specific cis-regulatory counterparts on the DNA. TFs recognize and bind their DNA counterparts according to the structure of their DNA-binding domains (e.g. zinc finger, leucine zipper, homeodomain). The structure of these domains can be used as a basis for grouping TFs into classes. Although the structure of DNA-binding domains varies widely across TFs generally, the TFs within a particular class bind to DNA in a similar fashion, suggesting the existence of class-specific features in the DNA sequences bound by each class of TFs. RESULTS: In this paper, we apply a sparse Bayesian learning algorithm to identify a small set of class-specific features in the DNA sequences bound by different classes of TFs; the algorithm simultaneously learns a true multi-class classifier that uses these features to predict the DNA-binding domain of the TF that recognizes a particular set of DNA sequences. We train our algorithm on the six largest classes in TRANSFAC, comprising a total of 587 TFs. We learn a six-class classifier for this training set that achieves 87% leave-one-out cross-validation accuracy. We also identify features within cis-regulatory sequences that are highly specific to each class of TF, which has significant implications for how TF binding sites should be modeled for the purpose of motif discovery.

Full Text

Duke Authors

Cited Authors

  • Narlikar, L; Hartemink, AJ

Published Date

  • January 2006

Published In

Volume / Issue

  • 22 / 2

Start / End Page

  • 157 - 163

PubMed ID

  • 16267080

Pubmed Central ID

  • 16267080

Electronic International Standard Serial Number (EISSN)

  • 1367-4811

International Standard Serial Number (ISSN)

  • 1367-4803

Digital Object Identifier (DOI)

  • 10.1093/bioinformatics/bti731

Language

  • eng