SparkScore: Leveraging apache spark for distributed genomic inference

Conference Paper

The method of the efficient score statistic is used extensively to conduct inference for high throughput genomic data due to its computational efficiency and abilityto accommodate simple and complex phenotypes. Inference based on these statistics can readily incorporate a priori knowledge from a vast collection of bioinformatics databases to further refine the analyses. The sampling distribution of the efficient score statistic is typically approximated using asymptotics. As this may be inappropriate in the context of small study size, or uncommon or rare variants, resampling methods are often used to approximate the exact sampling distribution. We propose SparkScore, a set of distributed computational algorithms implemented in Apache Spark, to leverage the embarrassingly parallel nature of genomic resampling inference on the basis of the efficient score statistics. We illustrate the application of this computational approachfor the analysis of data from genome-wide analysis studies(GWAS). This computational approach also harnesses thefault-tolerant features of Spark and can be readily extended to analysis of DNA and RNA sequencing data, including expression quantitative trait loci (eQTL) and phenotype association studies.

Full Text

Duke Authors

Cited Authors

  • Bahmani, A; Sibley, AB; Parsian, M; Owzar, K; Mueller, F

Published Date

  • July 18, 2016

Published In

  • Proceedings 2016 Ieee 30th International Parallel and Distributed Processing Symposium, Ipdps 2016

Start / End Page

  • 435 - 442

International Standard Book Number 13 (ISBN-13)

  • 9781509021406

Digital Object Identifier (DOI)

  • 10.1109/IPDPSW.2016.6

Citation Source

  • Scopus