Skip to main content
Journal cover image

Supervised compression of big data

Publication ,  Journal Article
Joseph, VR; Mak, S
Published in: Statistical Analysis and Data Mining
June 1, 2021

The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design-based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.

Duke Scholars

Published In

Statistical Analysis and Data Mining

DOI

EISSN

1932-1872

ISSN

1932-1864

Publication Date

June 1, 2021

Volume

14

Issue

3

Start / End Page

217 / 229

Related Subject Headings

  • 4905 Statistics
  • 4605 Data management and data science
  • 0104 Statistics
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Joseph, V. R., & Mak, S. (2021). Supervised compression of big data. Statistical Analysis and Data Mining, 14(3), 217–229. https://doi.org/10.1002/sam.11508
Joseph, V. R., and S. Mak. “Supervised compression of big data.” Statistical Analysis and Data Mining 14, no. 3 (June 1, 2021): 217–29. https://doi.org/10.1002/sam.11508.
Joseph VR, Mak S. Supervised compression of big data. Statistical Analysis and Data Mining. 2021 Jun 1;14(3):217–29.
Joseph, V. R., and S. Mak. “Supervised compression of big data.” Statistical Analysis and Data Mining, vol. 14, no. 3, June 2021, pp. 217–29. Scopus, doi:10.1002/sam.11508.
Joseph VR, Mak S. Supervised compression of big data. Statistical Analysis and Data Mining. 2021 Jun 1;14(3):217–229.
Journal cover image

Published In

Statistical Analysis and Data Mining

DOI

EISSN

1932-1872

ISSN

1932-1864

Publication Date

June 1, 2021

Volume

14

Issue

3

Start / End Page

217 / 229

Related Subject Headings

  • 4905 Statistics
  • 4605 Data management and data science
  • 0104 Statistics