CDA: Cost-Sensitive Data Acquisition for Incomplete Datasets
This paper introduces the novel concept of cost-sensitive data acquisition (CDA), a desirable addition to the data preparation process in a data science pipeline that focuses on strategically acquiring data from various priced sources, such as data markets, under budget constraints. CDA improves data quality by identifying the best set of values to acquire and integrating them into incomplete datasets, optimizing a particular objective defined in the resulting tables (data products). This paper focuses on CDA for a single relational table while also exploring possible extensions to multi-table contexts. First, we introduce an algorithm that utilizes conformal risk control to select rows likely to be included in the data product with probabilistic guarantees. We then investigate ways to acquire data to complete these rows under various CDA scenarios. We start with a scenario where data records are available on a row-wise basis, which proves to be an NP-hard problem. To solve this problem, we introduce an efficient row-wise greedy algorithm (RGreedy), which approaches an approximation ratio of 1. Subsequently, we explore a more generic scenario where each unit of data for acquisition may involve multiple records with a subset of the attributes. We propose a coverage minimum option selection (CMOS) algorithm for its solution, focusing on scalability. Through empirical evaluations on three real-world datasets and one synthetic dataset, we demonstrate that our methods yield performance improvements of 20 % to 40 % over applicable baselines.