Scholars@Duke publication: CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets

CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets

Publication , Conference

Li, K; Yu, X; Pei, J

Published in: Proceedings International Conference on Data Engineering

January 1, 2025

This paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets, under budget constraints. CDA improves data quality by identifying the best set of values to acquire and integrating them into incomplete datasets, optimizing a particular objective defined in the resulting tables (data products). This paper focuses on CDA for a single relational table while also exploring possible extensions to multi-table contexts. First, we introduce an algorithm that utilizes conformal risk control to select rows likely to be included in the data product with probabilistic guarantees. We then investigate ways to acquire data to complete these rows under various CDA scenarios. We start with a scenario where data records are available on a row-wise basis, which proves to be an NP-hard problem. To solve this problem, we introduce an efficient row-wise greedy algorithm (RGreedy), which approaches an approximation ratio of 1. Subsequently, we explore a more generic scenario where each unit of data for acquisition may involve multiple records with a subset of the attributes. We propose a coverage minimum option selection (CMOS) algorithm for its solution, focusing on scalability. Through empirical evaluations on three real-world datasets and one synthetic dataset, we demonstrate that our methods yield performance improvements of 20 % to 40 % over applicable baselines.

Duke Scholars

Author Jian Pei Computer Science

Published In

Proceedings International Conference on Data Engineering

DOI

10.1109/ICDE65448.2025.00120

EISSN

2375-0286

ISSN

1084-4627

Publication Date

January 1, 2025

Start / End Page

1551 / 1564

Citation

APA

Chicago

ICMJE

MLA

NLM

Li, K., Yu, X., & Pei, J. (2025). CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. In Proceedings International Conference on Data Engineering (pp. 1551–1564). https://doi.org/10.1109/ICDE65448.2025.00120

Li, K., X. Yu, and J. Pei. “CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets.” In Proceedings International Conference on Data Engineering, 1551–64, 2025. https://doi.org/10.1109/ICDE65448.2025.00120.

Li K, Yu X, Pei J. CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. In: Proceedings International Conference on Data Engineering. 2025. p. 1551–64.

Li, K., et al. “CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets.” Proceedings International Conference on Data Engineering, 2025, pp. 1551–64. Scopus, doi:10.1109/ICDE65448.2025.00120.

Li K, Yu X, Pei J. CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets. Proceedings International Conference on Data Engineering. 2025. p. 1551–1564.

Published In

Proceedings International Conference on Data Engineering

DOI

10.1109/ICDE65448.2025.00120

EISSN

2375-0286

ISSN

1084-4627

Publication Date

January 1, 2025

Start / End Page

1551 / 1564