Scholars@Duke publication: SparkScore: Leveraging apache spark for distributed genomic inference

SparkScore: Leveraging apache spark for distributed genomic inference

Publication , Conference

Bahmani, A; Sibley, AB; Parsian, M; Owzar, K; Mueller, F

Published in: Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

July 18, 2016

The method of the efficient score statistic is used extensively to conduct inference for high throughput genomic data due to its computational efficiency and abilityto accommodate simple and complex phenotypes. Inference based on these statistics can readily incorporate a priori knowledge from a vast collection of bioinformatics databases to further refine the analyses. The sampling distribution of the efficient score statistic is typically approximated using asymptotics. As this may be inappropriate in the context of small study size, or uncommon or rare variants, resampling methods are often used to approximate the exact sampling distribution. We propose SparkScore, a set of distributed computational algorithms implemented in Apache Spark, to leverage the embarrassingly parallel nature of genomic resampling inference on the basis of the efficient score statistics. We illustrate the application of this computational approachfor the analysis of data from genome-wide analysis studies(GWAS). This computational approach also harnesses thefault-tolerant features of Spark and can be readily extended to analysis of DNA and RNA sequencing data, including expression quantitative trait loci (eQTL) and phenotype association studies.

Duke Scholars

Author Kouros Owzar Biostatistics & Bioinformatics, Division of Integrative Geno ...

Published In

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

DOI

10.1109/IPDPSW.2016.6

ISBN

9781509021406

Publication Date

July 18, 2016

Start / End Page

435 / 442

Citation

APA

Chicago

ICMJE

MLA

NLM

Bahmani, A., Sibley, A. B., Parsian, M., Owzar, K., & Mueller, F. (2016). SparkScore: Leveraging apache spark for distributed genomic inference. In Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 (pp. 435–442). https://doi.org/10.1109/IPDPSW.2016.6

Bahmani, A., A. B. Sibley, M. Parsian, K. Owzar, and F. Mueller. “SparkScore: Leveraging apache spark for distributed genomic inference.” In Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 435–42, 2016. https://doi.org/10.1109/IPDPSW.2016.6.

Bahmani A, Sibley AB, Parsian M, Owzar K, Mueller F. SparkScore: Leveraging apache spark for distributed genomic inference. In: Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016. 2016. p. 435–42.

Bahmani, A., et al. “SparkScore: Leveraging apache spark for distributed genomic inference.” Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 2016, pp. 435–42. Scopus, doi:10.1109/IPDPSW.2016.6.

Bahmani A, Sibley AB, Parsian M, Owzar K, Mueller F. SparkScore: Leveraging apache spark for distributed genomic inference. Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016. 2016. p. 435–442.

Published In

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

DOI

10.1109/IPDPSW.2016.6

ISBN

9781509021406

Publication Date

July 18, 2016

Start / End Page

435 / 442