Skip to main content

CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets

Publication ,  Conference
Li, K; Yu, X; Pei, J
Published in: Proceedings International Conference on Data Engineering
January 1, 2025

This paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets, under budget constraints. CDA improves data quality by identifying the best set of values to acquire and integrating them into incomplete datasets, optimizing a particular objective defined in the resulting tables (data products). This paper focuses on CDA for a single relational table while also exploring possible extensions to multi-table contexts. First, we introduce an algorithm that utilizes conformal risk control to select rows likely to be included in the data product with probabilistic guarantees. We then investigate ways to acquire data to complete these rows under various CDA scenarios. We start with a scenario where data records are available on a row-wise basis, which proves to be an NP-hard problem. To solve this problem, we introduce an efficient row-wise greedy algorithm (RGreedy), which approaches an approximation ratio of 1. Subsequently, we explore a more generic scenario where each unit of data for acquisition may involve multiple records with a subset of the attributes. We propose a coverage minimum option selection (CMOS) algorithm for its solution, focusing on scalability. Through empirical evaluations on three real-world datasets and one synthetic dataset, we demonstrate that our methods yield performance improvements of 20 % to 40 % over applicable baselines.

Duke Scholars

Published In

Proceedings International Conference on Data Engineering

DOI

EISSN

2375-0286

ISSN

1084-4627

Publication Date

January 1, 2025

Start / End Page

1551 / 1564
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Li, K., Yu, X., & Pei, J. (2025). CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. In Proceedings International Conference on Data Engineering (pp. 1551–1564). https://doi.org/10.1109/ICDE65448.2025.00120
Li, K., X. Yu, and J. Pei. “CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets.” In Proceedings International Conference on Data Engineering, 1551–64, 2025. https://doi.org/10.1109/ICDE65448.2025.00120.
Li K, Yu X, Pei J. CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. In: Proceedings International Conference on Data Engineering. 2025. p. 1551–64.
Li, K., et al. “CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets.” Proceedings International Conference on Data Engineering, 2025, pp. 1551–64. Scopus, doi:10.1109/ICDE65448.2025.00120.
Li K, Yu X, Pei J. CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. Proceedings International Conference on Data Engineering. 2025. p. 1551–1564.

Published In

Proceedings International Conference on Data Engineering

DOI

EISSN

2375-0286

ISSN

1084-4627

Publication Date

January 1, 2025

Start / End Page

1551 / 1564