SparkScore: Leveraging apache spark for distributed genomic inference
The method of the efficient score statistic is used extensively to conduct inference for high throughput genomic data due to its computational efficiency and abilityto accommodate simple and complex phenotypes. Inference based on these statistics can readily incorporate a priori knowledge from a vast collection of bioinformatics databases to further refine the analyses. The sampling distribution of the efficient score statistic is typically approximated using asymptotics. As this may be inappropriate in the context of small study size, or uncommon or rare variants, resampling methods are often used to approximate the exact sampling distribution. We propose SparkScore, a set of distributed computational algorithms implemented in Apache Spark, to leverage the embarrassingly parallel nature of genomic resampling inference on the basis of the efficient score statistics. We illustrate the application of this computational approachfor the analysis of data from genome-wide analysis studies(GWAS). This computational approach also harnesses thefault-tolerant features of Spark and can be readily extended to analysis of DNA and RNA sequencing data, including expression quantitative trait loci (eQTL) and phenotype association studies.