Skip to main content

Sudeepa Roy

Associate Professor of Computer Science
Computer Science
Campus Box 90129, 308 Research Drive, Durham, NC 27708
LSRC D325, 308 Research Drive, Durham, NC 27708

Selected Publications


What Teaching Databases Taught Us about Researching Databases: Extended Talk Abstract

Conference ACM International Conference Proceeding Series · June 9, 2024 Declarative querying is a cornerstone of the success and longevity of database systems, yet it is challenging for novice learners accustomed to different coding paradigms. The transition is further hampered by a lack of query debugging tools compared to th ... Full text Cite

Evaluating Pre-trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference

Conference Proceedings of the AAAI Conference on Artificial Intelligence · March 25, 2024 After a person is arrested and charged with a crime, they may be released on bail and required to participate in a community supervision program while awaiting trial. These 'pretrial programs' are common throughout the United States, but very little resear ... Full text Cite

How Database Theory Helps Teach Relational Queries in Database Education

Conference Leibniz International Proceedings in Informatics, LIPIcs · March 1, 2024 Data analytics skills have become an indispensable part of any education that seeks to prepare its students for the modern workforce. Essential in this skill set is the ability to work with structured relational data. Relational queries are based on logic ... Full text Cite

Characterizing and Verifying Queries Via CINSGEN

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 4, 2023 Example database instances can be very helpful in understanding complex queries. Different examples may illustrate alternative situations in which answers emerge in the query results and can be useful for testing. Examples can also help reveal semantic dif ... Full text Cite

Seventh Workshop on Human-In-the-Loop Data Analytics (HILDA)

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 4, 2023 HILDA brings together researchers and practitioners to exchange ideas and results on human-data interaction. It explores how data management and analysis can be made more effective when taking into account the people who design and build these processes as ... Full text Cite

Causal Inference in Data Analysis with Applications to Fairness and Explanations

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2023 Causal inference is a fundamental concept that goes beyond simple correlation and model-based prediction analysis, and is highly relevant in domains such as health, medicine, and the social sciences. Causal inference enables the estimation of the impact of ... Full text Cite

Causal What-If and How-To Analysis Using HypeR

Conference Proceedings - International Conference on Data Engineering · January 1, 2023 What-if and How-to queries are fundamental data analysis questions that provide insights about the effects of a hypothetical update without actually making changes to the database. Traditional systems assume independence across differ¬ent tuples and non-up ... Full text Cite

Explaining Differentially Private Query Results With DPXPlain

Conference Proceedings of the VLDB Endowment · January 1, 2023 Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data i ... Full text Cite

DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms

Journal Article Proceedings of the VLDB Endowment · January 1, 2023 Synthetic data generation methods, and in particular, private synthetic data generation methods, are gaining popularity as a means to make copies of sensitive databases that can be shared widely for research and data analysis. Some of the fundamental opera ... Full text Cite

Selectivity Functions of Range Queries are Learnable

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 This paper explores the use of machine learning for estimating the selectivity of range queries in database systems. Using classic learning theory for real-valued functions based on shattering dimension, we show that the selectivity function of a range spa ... Full text Cite

Understanding Queries by Conditional Instances

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant t ... Full text Cite

HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 What-if (provisioning for an update to a database) and how-to (how to modify the database to achieve a goal) analyses provide insights to users who wish to examine hypothetical scenarios without making actual changes to a database and thereby help plan str ... Full text Cite

CaJaDE: Explaining Query Results by Augmenting Provenance with Context

Conference Proceedings of the VLDB Endowment · January 1, 2022 In this work, we demonstrate CaJaDE (Context-Aware Join-Aug-mented Deep Explanations), a system that explains query results by augmenting provenance with contextual information from other related tables in the database. Given two query results whose differ ... Full text Cite

Toward Interpretable and Actionable Data Analysis with Explanations and Causality

Conference Proceedings of the VLDB Endowment · January 1, 2022 We live in a world dominated by data, where users from different fields routinely collect, study, and make decisions supported by data. To aid these users, the current trend in data analysis is to design tools that allow large-scale analytics, sophisticate ... Full text Cite

DPXPlain: Privately Explaining Aggregate Query Answers

Conference Proceedings of the VLDB Endowment · January 1, 2022 Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today’s era of data analysis, however, it poses new challenges f ... Full text Cite

Trends in explanations: Understanding and debugging data-driven systems

Journal Article Foundations and Trends in Databases · August 2, 2021 Humans reason about the world around them by seeking to understand why and how something occurs. The same principle extends to the technology that so many of human activities increasingly rely on. Issues of trust, transparency, and understandability are cr ... Full text Cite

dame-flame: A Python Library Providing Fast Interpretable Matching for Causal Inference

Journal Article · January 5, 2021 dame-flame is a Python package for performing matching for observational causal inference on datasets containing discrete covariates. This package implements the Dynamic Almost Matching Exactly (DAME) and Fast Large-Scale Almost Matching Exactly (FLAME) al ... Open Access Link to item Cite

FLAME: A fast large-scale almost matching exactly approach to causal inference

Journal Article Journal of Machine Learning Research · January 1, 2021 A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional ca ... Open Access Cite

Putting Things into Context: Rich Explanations for Query Answers using Join Graphs

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021 In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Previous approaches to explaining results have directly or indirectly relied on data provenance, i.e., input tuples contributing t ... Full text Cite

Properties of Inconsistency Measures for Databases

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021 How should we quantify the inconsistency of a database that violates integrity constraints? Proper measures are important for various tasks, such as progress indication and action prioritization in cleaning systems, and reliability estimation for new datas ... Full text Cite

Making AI Machines Work for Humans in FoW

Journal Article SIGMOD Record · December 9, 2020 The Future of Work (FoW) is witnessing an evolution where AI systems are used to the benefit of humans. Work here refers to all forms of paid and unpaid labor in both physical and virtual workplaces and that is enabled by AI systems. This covers crowdsourc ... Full text Cite

Computing Local Sensitivities of Counting Queries with Joins

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020 Local sensitivity of a query Q given a database instance D, i.e. how much the output Q(D) changes when a tuple is added to D or deleted from D, has many applications including query analysis, outlier detection, and differential privacy. However, it is NP-h ... Full text Cite

Causal Relational Learning

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020 Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials ; unfortunately th ... Full text Cite

On Multiple Semantics for Declarative Database Repairs

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 14, 2020 We study the problem of database repairs through a rule-based framework that we refer to as Delta Rules. Delta rules are highly expressive and allow specifying complex, cross-relations repair logic associated with Denial Constraints, Causal Rules, and allo ... Full text Cite

Computing optimal repairs for functional dependencies

Journal Article ACM Transactions on Database Systems · February 17, 2020 We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs).We focus on two types of repairs: an optimal subset repair (optimal S-repair), which is obta ... Full text Cite

Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation

Journal Article CONFERENCE ON UNCERTAINTY IN ARTIFICIAL INTELLIGENCE (UAI 2020) · 2020 Open Access Cite

Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference

Journal Article INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108 · 2020 Cite

Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference

Conference Proceedings of Machine Learning Research · January 1, 2020 We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others' outcomes. Traditional treatment effect estim ... Cite

Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation

Conference Proceedings of Machine Learning Research · January 1, 2020 We propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treat ... Cite

Learning to sample: Counting with complex queries

Conference Proceedings of the VLDB Endowment · January 1, 2020 We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicab ... Full text Cite

Aggregated deletion propagation for counting conjunctive query answers

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We investigate the computational complexity of minimizing the source side-effect in order to remove a given number of tuples from the output of a conjunctive query. This is a variant of the well-studied deletion propagation problem, the difference being th ... Full text Cite

I-Rex: An Interactive Relational Query Explainer for SQL

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We demonstrate I-REX1, a system designed to help users understand SQL query evaluation and debug SQL queries. I-REX lets users interactively “trace” the evaluation of complex SQL queries, including those with correlated subqueries. I-REX also explains why ... Full text Cite

MuSe: Multiple Deletion Semantics for Data Repair

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We propose to demonstrate MuSe, a system for Database repairs where constraints are expressed as Declarative Rules and can be interpreted in different ways by using four different semantics. Our framework may capture common, cross-relation, repair semantic ... Full text Cite

RATest: Explaining Wrong Relational Queries Using Small Examples.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 We present a system called RATEST, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by e ... Full text Cite

iQCAR: inter-Query Contention Analyzer for Data Analytics Frameworks.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 Resource interferences caused by concurrent queries is one of the key reasons for unpredictable performance and missed workload SLAs in cluster computing systems. Analyzing these inter-query resource interactions is critical in order to answer time-sensiti ... Full text Cite

Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instanc ... Full text Cite

Explaining Wrong Queries Using Small Examples.

Journal Article Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 For testing the correctness of SQL queries, e.g., evaluating student submissions in a database course, a standard practice is to execute the query in question on some test database instance and compare its result with that of the correct query. Given two q ... Full text Cite

Interpretable Almost-Exact Matching for Causal Inference.

Journal Article Proceedings of machine learning research · April 2019 Matching methods are heavily used in the social and health sciences due to their interpretability. We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. The method proposed in t ... Cite

Interpretable almost-matching-exactly with instrumental variables

Journal Article 35th Conference on Uncertainty in Artificial Intelligence, UAI 2019 · January 1, 2019 © 2019 Association For Uncertainty in Artificial Intelligence (AUAI). All rights reserved. Uncertainty in the estimation of the causal effect in observational studies is often due to unmeasured confounding, i.e., the presence of unobserved covariates linki ... Cite

Interpretable almost-matching-exactly with instrumental variables

Conference 35th Conference on Uncertainty in Artificial Intelligence, UAI 2019 · January 1, 2019 Uncertainty in the estimation of the causal effect in observational studies is often due to unmeasured confounding, i.e., the presence of unobserved covariates linking treatments and outcomes. Instrumental Variables (IV) are commonly used to reduce the eff ... Cite

Interpretable Almost-Matching-Exactly With Instrumental Variables

Conference Proceedings of Machine Learning Research · January 1, 2019 Uncertainty in the estimation of the causal effect in observational studies is often due to unmeasured confounding, i.e., the presence of unobserved covariates linking treatments and outcomes. Instrumental Variables (IV) are commonly used to reduce the eff ... Cite

iQCAR

Conference Proceedings of the ACM Symposium on Cloud Computing · October 11, 2018 Full text Cite

Interpretable Almost Matching Exactly for Causal Inference

Journal Article · June 18, 2018 We aim to create the highest possible quality of treatment-control matches for categorical data in the potential outcomes framework. Matching methods are heavily used in the social sciences due to their interpretability, but most matching methods do not pa ... Link to item Cite

QAGView: Interactively Summarizing High-Valued Aggregate Query Answers.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2018 Methods for summarizing and diversifying query results have drawn significant attention recently, because they help present query results with lots of tuples to users in more informative ways. We present QAGView (Quick AGgregate View), which provides a hol ... Full text Cite

iQCAR: A Demonstration of an Inter-Query Contention Analyzer for Cluster Computing Frameworks.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2018 Unpredictability in query runtimes can arise in a shared cluster as a result of resource contentions caused by inter-query interactions. iQCAR - inter Query Contention AnalyzeR is a system that formally models these inter ... Full text Cite

Computing Optimal Repairs for Functional Dependencies.

Journal Article Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems · June 2018 We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtai ... Full text Cite

Interactive summarization and exploration of top aggregate query answers

Conference Proceedings of the VLDB Endowment · January 1, 2018 We present a system for summarization and interactive exploration of high-valued aggregate query answers to make a large set of possible answers more informative to the user. Our system outputs a set of clusters on the high-valued query answers showing the ... Full text Cite

Opportunities for data management research in the era of horizontal AI/ML

Conference Proceedings of the VLDB Endowment · January 1, 2018 AI/ML is becoming a horizontal technology: its application is expanding to more domains, and its integration touches more parts of the technology stack. Given the strong dependence of ML on data, this expansion creates a new space for applying data managem ... Full text Cite

CAPE: Explaining outliers by counterbalancing

Conference Proceedings of the VLDB Endowment · January 1, 2018 In this demonstration we showcase Cape, a system that explains surprising aggregation outcomes. In contrast to previous work, which relies exclusively on provenance, Cape explains outliers in aggregation queries through related outliers in the opposite dir ... Full text Cite

LensXPlain: Visualizing and explaining contributing subsets for aggregate query answers

Conference Proceedings of the VLDB Endowment · January 1, 2018 In this demonstration, we will present LensXPlain, an interactive system to help users understand answers of aggregate queries by providing meaningful explanations. Given a SQL group-by query and a question from a user \why output o is high /low", or \why ... Full text Cite

Answering Conjunctive Queries with Inequalities

Journal Article Theory of Computing Systems · July 1, 2017 In this paper, we study the complexity of answering conjunctive queries (CQ) with inequalities (≠). In particular, we are interested in comparing the complexity of the query with and without inequalities. The main contribution of our work is a novel combin ... Full text Cite

Optimizing iceberg queries with complex joins

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · May 9, 2017 Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for comb ... Full text Cite

Exact model counting of query expressions: Limitations of propositional methods

Conference ACM Transactions on Database Systems · February 1, 2017 We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms-algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be ... Full text Cite

Explaining query answers with explanation-ready databases

Chapter · January 1, 2016 With the increased generation and availability of big data in different domains, there is an imminent requirement for data analysis tools that are able to `explain' the trends and anomalies obtained from this data to a range of users with different backgro ... Cite

Top-k and clustering with noisy comparisons

Journal Article ACM Transactions on Database Systems · December 30, 2014 We study the problems of max/top-k and clustering when the comparison operations may be performed by oracles whose answer may be erroneous. Comparisons may either be of type or of value: given two data elements, the answer to a type comparison is "yes" if ... Full text Cite

Provenance views for module privacy

Conference Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems · July 15, 2011 Scientific workflow systems increasingly store provenance information about the module executions used to produce a data item, as well as the parameter settings and intermediate data items passed between module executions. However, authors/owners of workfl ... Full text Cite

Hiding Data and Structure in Work ow Provenance

Other International Workshop on Databases in Networked Information Systems (DNIS) · 2011 Cite

On Provenance and Privacy

Other International Conference on Database Theory (ICDT) · 2011 Cite

Privacy Issues in Scientific Work ow Provenance

Other International Workshop on Work ow Approaches to New Data-centric Science (WANDS) · 2010 Cite

Tool for translating simulink models into input language of a model checker

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006 Model Based Development (MBD) using Mathworks tools like Simulink, Stateflow etc. is being pursued in Honeywell for the development of safety critical avionics software. Formal verification techniques are well-known to identify design errors of safety crit ... Full text Cite

Detector concepts

Conference LCWS 2005 - 2005 International Linear Collider Workshop · January 1, 2005 Cite