Prediction and functional characterization of transcriptional activation domains
Gene expression is induced by transcription factors (TFs) through their activation domains (ADs). However, ADs are unconserved, intrinsically disordered sequences without a secondary structure, making it challenging to recognize and predict these regions and limiting our ability to identify TFs. Here, we address this challenge by leveraging a neural network approach to systematically predict ADs. As input for our neural network, we used computed properties for amino acid (AA) side chain and secondary structure, rather than relying on the raw sequence. Moreover, to shed light on the features learned by our neural network and greatly increase interpretability, we computed the input properties most important for an accurate prediction. Our findings further highlight the importance of aromatic and negatively charged AA and reveal the importance of unknown AA properties. Taking advantage of these most important features, we used an unsupervised learning approach to classify the ADs into 10 subclasses, which can further be explored for AA specificity and AD functionality. Overall, our pipeline, relying on supervised and unsupervised machine learning, shed light on the non-linear properties of ADs.