Skip to main content
Journal cover image

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.

Publication ,  Journal Article
Lin, J; Sibley, A; Shterev, I; Nixon, A; Innocenti, F; Chan, C; Owzar, K
Published in: BMC Bioinformatics
June 13, 2019

BACKGROUND: Parametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features. RESULTS: We have developed an R extension package, fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses. The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)). The computational efficiency is demonstrated through extensive benchmarking, and example applications to real data are presented. CONCLUSIONS: fastJT is an open-source R extension package, applying the Jonckheere-Terpstra statistic for robust feature selection for machine learning and association studies. The package implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

BMC Bioinformatics

DOI

EISSN

1471-2105

Publication Date

June 13, 2019

Volume

20

Issue

1

Start / End Page

333

Location

England

Related Subject Headings

  • Quantitative Trait, Heritable
  • Polymorphism, Single Nucleotide
  • Machine Learning
  • Genome-Wide Association Study
  • Computer Simulation
  • Blood Proteins
  • Bioinformatics
  • Algorithms
  • 49 Mathematical sciences
  • 46 Information and computing sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Lin, J., Sibley, A., Shterev, I., Nixon, A., Innocenti, F., Chan, C., & Owzar, K. (2019). fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies. BMC Bioinformatics, 20(1), 333. https://doi.org/10.1186/s12859-019-2869-3
Lin, Jiaxing, Alexander Sibley, Ivo Shterev, Andrew Nixon, Federico Innocenti, Cliburn Chan, and Kouros Owzar. “fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.BMC Bioinformatics 20, no. 1 (June 13, 2019): 333. https://doi.org/10.1186/s12859-019-2869-3.
Lin J, Sibley A, Shterev I, Nixon A, Innocenti F, Chan C, et al. fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies. BMC Bioinformatics. 2019 Jun 13;20(1):333.
Lin, Jiaxing, et al. “fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies.BMC Bioinformatics, vol. 20, no. 1, June 2019, p. 333. Pubmed, doi:10.1186/s12859-019-2869-3.
Lin J, Sibley A, Shterev I, Nixon A, Innocenti F, Chan C, Owzar K. fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies. BMC Bioinformatics. 2019 Jun 13;20(1):333.
Journal cover image

Published In

BMC Bioinformatics

DOI

EISSN

1471-2105

Publication Date

June 13, 2019

Volume

20

Issue

1

Start / End Page

333

Location

England

Related Subject Headings

  • Quantitative Trait, Heritable
  • Polymorphism, Single Nucleotide
  • Machine Learning
  • Genome-Wide Association Study
  • Computer Simulation
  • Blood Proteins
  • Bioinformatics
  • Algorithms
  • 49 Mathematical sciences
  • 46 Information and computing sciences