Skip to main content
Journal cover image

An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.

Publication ,  Journal Article
Goldstein, BA; Hubbard, AE; Cutler, A; Barcellos, LF
Published in: BMC Genet
June 14, 2010

BACKGROUND: As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. RESULTS: Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies. CONCLUSIONS: This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

BMC Genet

DOI

EISSN

1471-2156

Publication Date

June 14, 2010

Volume

11

Start / End Page

49

Location

England

Related Subject Headings

  • Polymorphism, Single Nucleotide
  • Multiple Sclerosis
  • Humans
  • Genotype
  • Genome-Wide Association Study
  • Genetics & Heredity
  • Genetic Predisposition to Disease
  • Feasibility Studies
  • Computational Biology
  • Artificial Intelligence
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Goldstein, B. A., Hubbard, A. E., Cutler, A., & Barcellos, L. F. (2010). An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet, 11, 49. https://doi.org/10.1186/1471-2156-11-49
Goldstein, Benjamin A., Alan E. Hubbard, Adele Cutler, and Lisa F. Barcellos. “An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.BMC Genet 11 (June 14, 2010): 49. https://doi.org/10.1186/1471-2156-11-49.
Goldstein BA, Hubbard AE, Cutler A, Barcellos LF. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet. 2010 Jun 14;11:49.
Goldstein, Benjamin A., et al. “An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings.BMC Genet, vol. 11, June 2010, p. 49. Pubmed, doi:10.1186/1471-2156-11-49.
Goldstein BA, Hubbard AE, Cutler A, Barcellos LF. An application of Random Forests to a genome-wide association dataset: methodological considerations & new findings. BMC Genet. 2010 Jun 14;11:49.
Journal cover image

Published In

BMC Genet

DOI

EISSN

1471-2156

Publication Date

June 14, 2010

Volume

11

Start / End Page

49

Location

England

Related Subject Headings

  • Polymorphism, Single Nucleotide
  • Multiple Sclerosis
  • Humans
  • Genotype
  • Genome-Wide Association Study
  • Genetics & Heredity
  • Genetic Predisposition to Disease
  • Feasibility Studies
  • Computational Biology
  • Artificial Intelligence