Skip to main content
Journal cover image

A divide-and-conquer method for sparse risk prediction and evaluation.

Publication ,  Journal Article
Hong, C; Wang, Y; Cai, T
Published in: Biostatistics
April 13, 2022

Divide-and-conquer (DAC) is a commonly used strategy to overcome the challenges of extraordinarily large data, by first breaking the dataset into series of data blocks, then combining results from individual data blocks to obtain a final estimation. Various DAC algorithms have been proposed to fit a sparse predictive regression model in the $L_1$ regularization setting. However, many existing DAC algorithms remain computationally intensive when sample size and number of candidate predictors are both large. In addition, no existing DAC procedures provide inference for quantifying the accuracy of risk prediction models. In this article, we propose a screening and one-step linearization infused DAC (SOLID) algorithm to fit sparse logistic regression to massive datasets, by integrating the DAC strategy with a screening step and sequences of linearization. This enables us to maximize the likelihood with only selected covariates and perform penalized estimation via a fast approximation to the likelihood. To assess the accuracy of a predictive regression model, we develop a modified cross-validation (MCV) that utilizes the side products of the SOLID, substantially reducing the computational burden. Compared with existing DAC methods, the MCV procedure is the first to make inference on accuracy. Extensive simulation studies suggest that the proposed SOLID and MCV procedures substantially outperform the existing methods with respect to computational speed and achieve similar statistical efficiency as the full sample-based estimator. We also demonstrate that the proposed inference procedure provides valid interval estimators. We apply the proposed SOLID procedure to develop and validate a classification model for disease diagnosis using narrative clinical notes based on electronic medical record data from Partners HealthCare.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Biostatistics

DOI

EISSN

1468-4357

Publication Date

April 13, 2022

Volume

23

Issue

2

Start / End Page

397 / 411

Location

England

Related Subject Headings

  • Statistics & Probability
  • Research Design
  • Logistic Models
  • Humans
  • Computer Simulation
  • Algorithms
  • 4905 Statistics
  • 0604 Genetics
  • 0104 Statistics
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Hong, C., Wang, Y., & Cai, T. (2022). A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics, 23(2), 397–411. https://doi.org/10.1093/biostatistics/kxaa031
Hong, Chuan, Yan Wang, and Tianxi Cai. “A divide-and-conquer method for sparse risk prediction and evaluation.Biostatistics 23, no. 2 (April 13, 2022): 397–411. https://doi.org/10.1093/biostatistics/kxaa031.
Hong C, Wang Y, Cai T. A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics. 2022 Apr 13;23(2):397–411.
Hong, Chuan, et al. “A divide-and-conquer method for sparse risk prediction and evaluation.Biostatistics, vol. 23, no. 2, Apr. 2022, pp. 397–411. Pubmed, doi:10.1093/biostatistics/kxaa031.
Hong C, Wang Y, Cai T. A divide-and-conquer method for sparse risk prediction and evaluation. Biostatistics. 2022 Apr 13;23(2):397–411.
Journal cover image

Published In

Biostatistics

DOI

EISSN

1468-4357

Publication Date

April 13, 2022

Volume

23

Issue

2

Start / End Page

397 / 411

Location

England

Related Subject Headings

  • Statistics & Probability
  • Research Design
  • Logistic Models
  • Humans
  • Computer Simulation
  • Algorithms
  • 4905 Statistics
  • 0604 Genetics
  • 0104 Statistics