Profile editing will be unavailable for Scholars@Duke profiles from June 11-24, 2026 as manual profile data entry transitions to Elements.
More information about the transition.
ConferenceProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · August 3, 2025
Programmatic weak supervision (PWS) significantly reduces human effort for labeling data by combining the outputs of user-provided labeling functions (LFs) on unlabeled datapoints. However, the quality of the generated labels depends directly on the accura ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 22, 2025
Group-by-average SQL queries are a cornerstone of data analysis, often employed to uncover patterns and trends within datasets. However, interpreting the results of these queries can be challenging and time-intensive, particularly when working with large, ...
Full textCite
Journal ArticleVLDB Journal · March 1, 2025
Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today’s era of data analysis, however, it poses new challenges f ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2025
Datasets may include errors, and specifically violations of integrity constraints, for various reasons. Standard techniques for "minimal cost" database repairing resolve these violations by aiming for a minimum change in the data, and in the process, may s ...
Full textCite
Journal ArticleJournal of Statistical Software · January 1, 2025
dame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the dynamic almost matching exactly (DAME) and fast, large-scale almost matching exactly (FLAME) a ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2025
Estimating causal effects in social network data presents unique challenges due to the presence of spillover effects and network-induced confounding. While much of the existing literature addresses causal inference in social networks, many methods rely on ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2025
Query optimizers rely heavily on selectivity estimates to choose efficient execution plans, but inaccuracies in these estimates often result in poor query performance. We introduce Hint-QPT (Hints for Robust Query Performance Tuning), an interactive tool d ...
Full textCite
ConferenceACM International Conference Proceeding Series · July 2, 2024
Declarative querying is a cornerstone of the success and longevity of database systems, yet it is challenging for novice learners accustomed to different coding paradigms. The transition is further hampered by a lack of query debugging tools compared to th ...
Full textCite
Journal ArticleAnnals of the Entomological Society of America · May 29, 2024
We describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the t ...
Full textCite
ConferenceProceedings of the Aaai Conference on Artificial Intelligence · March 25, 2024
After a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pretrial programs' are common throughout the United States, but very little resear ...
Full textCite
ConferenceLeibniz International Proceedings in Informatics Lipics · March 1, 2024
Data analytics skills have become an indispensable part of any education that seeks to prepare its students for the modern workforce. Essential in this skill set is the ability to work with structured relational data. Relational queries are based on logic ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 5, 2023
Example database instances can be very helpful in understanding complex queries. Different examples may illustrate alternative situations in which answers emerge in the query results and can be useful for testing. Examples can also help reveal semantic dif ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 5, 2023
HILDA brings together researchers and practitioners to exchange ideas and results on human-data interaction. It explores how data management and analysis can be made more effective when taking into account the people who design and build these processes as ...
Full textCite
ConferenceLecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2023
Causal inference is a fundamental concept that goes beyond simple correlation and model-based prediction analysis, and is highly relevant in domains such as health, medicine, and the social sciences. Causal inference enables the estimation of the impact of ...
Full textCite
ConferenceProceedings International Conference on Data Engineering · January 1, 2023
What-if and How-to queries are fundamental data analysis questions that provide insights about the effects of a hypothetical update without actually making changes to the database. Traditional systems assume independence across differ¬ent tuples and non-up ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2023
Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data i ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2023
Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental opera ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 11, 2022
This paper explores the use of machine learning for estimating the selectivity of range queries in database systems. Using classic learning theory for real-valued functions based on shattering dimension, we show that the selectivity function of a range spa ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 11, 2022
A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant t ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 11, 2022
What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan str ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2022
In this work, we demonstrate CaJaDE (Context-Aware Join-Aug-mented Deep Explanations), a system that explains query results by augmenting provenance with contextual information from other related tables in the database. Given two query results whose differ ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2022
We live in a world dominated by data, where users from different fields routinely collect, study, and make decisions supported by data. To aid these users, the current trend in data analysis is to design tools that allow large-scale analytics, sophisticate ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2022
Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today’s era of data analysis, however, it poses new challenges f ...
Full textCite
Journal ArticleFoundations and Trends in Databases · August 2, 2021
Humans reason about the world around them by seeking to understand why and how something occurs. The same principle extends to the technology that so many of human activities increasingly rely on. Issues of trust, transparency, and understandability are cr ...
Full textCite
Journal Article · January 5, 2021
dame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the Dynamic Almost Matching Exactly (DAME) and Fast Large-Scale Almost Matching Exactly (FLAME) al ...
Open AccessLink to itemCite
Journal ArticleJournal of Machine Learning Research · January 1, 2021
A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional ca ...
Open AccessCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021
In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Previous approaches to explaining results have directly or indirectly relied on data provenance, i.e., input tuples contributing t ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021
How should we quantify the inconsistency of a database that violates integrity constraints? Proper measures are important for various tasks, such as progress indication and action prioritization in cleaning systems, and reliability estimation for new datas ...
Full textCite
Journal ArticleSIGMOD Record · December 9, 2020
The Future of Work (FoW) is witnessing an evolution where AI systems are used to the benefit of humans. Work here refers to all forms of paid and unpaid labor in both physical and virtual workplaces and that is enabled by AI systems. This covers crowdsourc ...
Full textCite
Journal ArticleProceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020
Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and differential privacy. However, it is NP-h ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020
Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials ; unfortunately th ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020
We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allo ...
Full textCite
Journal ArticleACM Transactions on Database Systems · February 17, 2020
We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs).We focus on two types of repairs: an optimal subset repair (optimal S-repair), which is obta ...
Full textCite
ConferenceProceedings of Machine Learning Research · January 1, 2020
We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others' outcomes. Traditional treatment effect estim ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2020
We propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treat ...
Cite
ConferenceProceedings of the VLDB Endowment · January 1, 2020
We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicab ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2020
We investigate the computational complexity of minimizing the source side-effect in order to remove a given number of tuples from the output of a conjunctive query. This is a variant of the well-studied deletion propagation problem, the difference being th ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2020
We demonstrate I-REX1, a system designed to help users understand SQL query evaluation and debug SQL queries. I-REX lets users interactively “trace” the evaluation of complex SQL queries, including those with correlated subqueries. I-REX also ex ...
Full textCite
Journal ArticleProceedings of the VLDB Endowment · January 1, 2020
We propose to demonstrate MuSe, a system for Database repairs where constraints are expressed as Declarative Rules and can be interpreted in different ways by using four different semantics. Our framework may capture common, cross-relation, repair semantic ...
Full textCite
ConferenceProceedings. ACM-SIGMOD International Conference on Management of Data · June 2019
We present a system called RATEST, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by e ...
Full textCite
ConferenceProceedings. ACM-SIGMOD International Conference on Management of Data · June 2019
Resource interferences caused by concurrent queries is one of the key reasons for unpredictable performance and missed workload SLAs in cluster computing systems. Analyzing these inter-query resource interactions is critical in order to answer time-sensiti ...
Full textCite
ConferenceProceedings. ACM-SIGMOD International Conference on Management of Data · June 2019
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instanc ...
Full textCite
Journal ArticleProceedings. ACM-SIGMOD International Conference on Management of Data · June 2019
For testing the correctness of SQL queries, e.g., evaluating student submissions in a database course, a standard practice is to execute the query in question on some test database instance and compare its result with that of the correct query. Given two q ...
Full textCite
Journal ArticleProceedings of machine learning research · April 2019
Matching methods are heavily used in the social and health sciences due to their interpretability. We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. The method proposed in t ...
Cite
Conference35th Conference on Uncertainty in Artificial Intelligence Uai 2019 · January 1, 2019
Uncertainty in the estimation of the causal effect in observational studies is often due to unmeasured confounding, i.e., the presence of unobserved covariates linking treatments and outcomes. Instrumental Variables (IV) are commonly used to reduce the eff ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2019
Uncertainty in the estimation of the causal effect in observational studies is often due to unmeasured confounding, i.e., the presence of unobserved covariates linking treatments and outcomes. Instrumental Variables (IV) are commonly used to reduce the eff ...
Cite
ConferenceProceedings of Machine Learning Research · January 1, 2019
Matching methods are heavily used in the social and health sciences due to their inter-pretability. We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. The method proposed in ...
Cite
Journal Article · June 18, 2018
We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods do not pa ...
Link to itemCite
ConferenceProceedings. ACM-SIGMOD International Conference on Management of Data · June 2018
Methods for summarizing and diversifying query results have drawn significant attention recently, because they help present query results with lots of tuples to users in more informative ways. We present QAGView (Quick AGgregate View), which provides a hol ...
Full textCite
ConferenceProceedings. ACM-SIGMOD International Conference on Management of Data · June 2018
Unpredictability in query runtimes can arise in a shared cluster as a result of resource contentions caused by inter-query interactions. iQCAR - inter Query Contention AnalyzeR is a system that formally models these inter ...
Full textCite
Journal ArticleProceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems · June 2018
We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtai ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2018
We present a system for summarization and interactive exploration of high-valued aggregate query answers to make a large set of possible answers more informative to the user. Our system outputs a set of clusters on the high-valued query answers showing the ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2018
AI/ML is becoming a horizontal technology: its application is expanding to more domains, and its integration touches more parts of the technology stack. Given the strong dependence of ML on data, this expansion creates a new space for applying data managem ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2018
In this demonstration we showcase Cape, a system that explains surprising aggregation outcomes. In contrast to previous work, which relies exclusively on provenance, Cape explains outliers in aggregation queries through related outliers in the opposite dir ...
Full textCite
ConferenceProceedings of the VLDB Endowment · January 1, 2018
In this demonstration, we will present LensXPlain, an interactive system to help users understand answers of aggregate queries by providing meaningful explanations. Given a SQL group-by query and a question from a user \why output o is high /low", or \why ...
Full textCite
Journal ArticleTheory of Computing Systems · July 1, 2017
In this paper, we study the complexity of answering conjunctive queries (CQ) with inequalities (≠). In particular, we are interested in comparing the complexity of the query with and without inequalities. The main contribution of our work is a novel combin ...
Full textCite
ConferenceProceedings of the ACM SIGMOD International Conference on Management of Data · May 9, 2017
Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for comb ...
Full textCite
ConferenceACM Transactions on Database Systems · February 1, 2017
We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms-algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be ...
Full textCite
Chapter · January 1, 2016
With the increased generation and availability of big data in different domains, there is an imminent requirement for data analysis tools that are able to `explain' the trends and anomalies obtained from this data to a range of users with different backgro ...
Cite
Journal ArticleACM Transactions on Database Systems · December 30, 2014
We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is "yes" if ...
Full textCite
ConferenceProceedings of the ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems · January 1, 2011
Scientific workflow systems increasingly store provenance information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions. However, authors/owners of workfl ...
Full textCite
Journal ArticleLecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2006
Model Based Development (MBD) using Mathworks tools like Simulink, Stateflow etc. is being pursued in Honeywell for the development of safety critical avionics software. Formal verification techniques are well-known to identify design errors of safety crit ...
Full textCite