Scholars@Duke publication: Computing Data Distribution from Query Selectivities

Computing Data Distribution from Query Selectivities

Publication , Conference

Agarwal, PK; Raychaudhury, R; Sintos, S; Yang, J

Published in: Leibniz International Proceedings in Informatics Lipics

March 1, 2024

We are given a set Z = {(R1, s1), ..., (Rn, sn)}, where each Ri is a range in R^d, such as rectangle or ball, and si ∈ [0, 1] denotes its selectivity. The goal is to compute a small-size discrete data distribution D = {(q1, w1), ..., (qm, wm)}, where qj ∈ R^d and wj ∈ [0, 1] for each 1 ≤ j ≤ m, and ∑1≤j≤m wj = 1, such that D is the most consistent with Z, i.e., (Eqaution presented) is minimized. In a database setting, Z corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and D can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^−d)δ⁻² polylog n), a discrete distribution (Eqaution presented) of size O(δ⁻²), such that errp((Eqaution presented), Z) ≤ minD errp(D, Z) + δ (for p = 1, 2, ∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Duke Scholars

Author Jun Yang Computer Science

Author Pankaj K. Agarwal Computer Science

Published In

Leibniz International Proceedings in Informatics Lipics

DOI

10.4230/LIPIcs.ICDT.2024.18

ISSN

1868-8969

Publication Date

March 1, 2024

Volume

290

Related Subject Headings

46 Information and computing sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Agarwal, P. K., Raychaudhury, R., Sintos, S., & Yang, J. (2024). Computing Data Distribution from Query Selectivities. In Leibniz International Proceedings in Informatics Lipics (Vol. 290). https://doi.org/10.4230/LIPIcs.ICDT.2024.18

Agarwal, P. K., R. Raychaudhury, S. Sintos, and J. Yang. “Computing Data Distribution from Query Selectivities.” In Leibniz International Proceedings in Informatics Lipics, Vol. 290, 2024. https://doi.org/10.4230/LIPIcs.ICDT.2024.18.

Agarwal PK, Raychaudhury R, Sintos S, Yang J. Computing Data Distribution from Query Selectivities. In: Leibniz International Proceedings in Informatics Lipics. 2024.

Agarwal, P. K., et al. “Computing Data Distribution from Query Selectivities.” Leibniz International Proceedings in Informatics Lipics, vol. 290, 2024. Scopus, doi:10.4230/LIPIcs.ICDT.2024.18.

Agarwal PK, Raychaudhury R, Sintos S, Yang J. Computing Data Distribution from Query Selectivities. Leibniz International Proceedings in Informatics Lipics. 2024.

Published In

Leibniz International Proceedings in Informatics Lipics

DOI

10.4230/LIPIcs.ICDT.2024.18

ISSN

1868-8969

Publication Date

March 1, 2024

Volume

290

Related Subject Headings

46 Information and computing sciences