Skip to main content

Jun Yang

Bishop-MacDermott Family Professor
Computer Science
Box 90129, Durham, NC 27708-0129
D308 LSRC, Durham, NC 27708

Selected Publications


Development and validation of VaxConcerns: A taxonomy of vaccine concerns and misinformation with Crowdsource-Viability.

Journal Article Vaccine · April 2024 We present VaxConcerns, a taxonomy for vaccine concerns and misinformation. VaxConcerns is an easy-to-teach taxonomy of concerns and misinformation commonly found among online anti-vaccination media and is evaluated to produce high-quality data annotations ... Full text Cite

Computing Data Distribution from Query Selectivities

Conference Leibniz International Proceedings in Informatics, LIPIcs · March 1, 2024 We are given a set Z = {(R1, s1), ..., (Rn, sn)}, where each Ri is a range in Rd, such as rectangle or ball, and si ∈ [0, 1] denotes its selectivity. The goal is to compute a small-size discrete data distribution D = {(q1, w1), ..., (qm, wm)}, where qj ∈ R ... Full text Cite

How Database Theory Helps Teach Relational Queries in Database Education

Conference Leibniz International Proceedings in Informatics, LIPIcs · March 1, 2024 Data analytics skills have become an indispensable part of any education that seeks to prepare its students for the modern workforce. Essential in this skill set is the ability to work with structured relational data. Relational queries are based on logic ... Full text Cite

Characterizing and Verifying Queries Via CINSGEN

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 4, 2023 Example database instances can be very helpful in understanding complex queries. Different examples may illustrate alternative situations in which answers emerge in the query results and can be useful for testing. Examples can also help reveal semantic dif ... Full text Cite

Interface Design for Crowdsourcing Hierarchical Multi-Label Text Annotations

Conference Conference on Human Factors in Computing Systems - Proceedings · April 19, 2023 Human data labeling is an important and expensive task at the heart of supervised learning systems. Hierarchies help humans understand and organize concepts. We ask whether and how concept hierarchies can inform the design of annotation interfaces to impro ... Full text Cite

Selectivity Functions of Range Queries are Learnable

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 This paper explores the use of machine learning for estimating the selectivity of range queries in database systems. Using classic learning theory for real-valued functions based on shattering dimension, we show that the selectivity function of a range spa ... Full text Cite

Computing Complex Temporal Join Queries Efficiently

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 This paper studies multi-way join queries over temporal data, where each tuple is associated with a valid time interval indicating when the tuple is valid. A temporal join requires that joining tuples' valid intervals intersect. Previous work on temporal j ... Full text Cite

Understanding Queries by Conditional Instances

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant t ... Full text Cite

Dynamic enumeration of similarity joins

Journal Article Leibniz International Proceedings in Informatics, LIPIcs · July 1, 2021 This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of n points A,B in ℝd, a metric φ(·), and a distance threshold r > 0, report all pairs of points (a, b) ∈ A × B with φ(a, b) ≤ r. Our goal is to store ... Full text Cite

Durable top-k instant-stamped temporal records with user-specified scoring functions

Journal Article Proceedings - International Conference on Data Engineering · April 1, 2021 A way of finding interesting or exceptional records from instant-stamped temporal data is to consider their "durability, "or, intuitively speaking, how well they compare with other records that arrived earlier or later, and how long they retain their supre ... Full text Cite

Efficiently Answering Durability Prediction Queries

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021 We consider a class of queries called durability prediction queries that arise commonly in predictive analytics, where we use a given predictive model to answer questions about possible futures to inform our decisions. Examples of durability prediction que ... Full text Cite

Efficiently Answering Durability Prediction Queries.

Conference SIGMOD Conference · 2021 Cite

Poirot: Private contact summary aggregation: Poster abstract

Conference SenSys 2020 - Proceedings of the 2020 18th ACM Conference on Embedded Networked Sensor Systems · November 16, 2020 Physical distancing between individuals is key to preventing the spread of a disease such as COVID-19. On the one hand, having access to information about physical interactions is critical for decision makers; on the other, this information is sensitive an ... Full text Cite

Selecting data to clean for fact checking: Minimizing uncertainty vs. maximizing surprise

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We study the optimization problem of selecting numerical quantities to clean in order to fact-check claims based on such data. Oftentimes, such claims are technically correct, but they can still mislead for two reasons. First, data may contain uncertainty ... Full text Cite

Learning to sample: Counting with complex queries

Conference Proceedings of the VLDB Endowment · January 1, 2020 We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicab ... Full text Cite

I-Rex: An Interactive Relational Query Explainer for SQL

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We demonstrate I-REX1, a system designed to help users understand SQL query evaluation and debug SQL queries. I-REX lets users interactively “trace” the evaluation of complex SQL queries, including those with correlated subqueries. I-REX also explains why ... Full text Cite

Special Issue of DASFAA 2019

Journal Article Data Science and Engineering · September 1, 2019 Full text Cite

RATest: Explaining Wrong Relational Queries Using Small Examples.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 We present a system called RATEST, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by e ... Full text Cite

Development and validation of VaxConcerns: A taxonomy of vaccine concerns and misinformation with Crowdsource-Viability.

Journal Article Vaccine · April 2024 We present VaxConcerns, a taxonomy for vaccine concerns and misinformation. VaxConcerns is an easy-to-teach taxonomy of concerns and misinformation commonly found among online anti-vaccination media and is evaluated to produce high-quality data annotations ... Full text Cite

Computing Data Distribution from Query Selectivities

Conference Leibniz International Proceedings in Informatics, LIPIcs · March 1, 2024 We are given a set Z = {(R1, s1), ..., (Rn, sn)}, where each Ri is a range in Rd, such as rectangle or ball, and si ∈ [0, 1] denotes its selectivity. The goal is to compute a small-size discrete data distribution D = {(q1, w1), ..., (qm, wm)}, where qj ∈ R ... Full text Cite

How Database Theory Helps Teach Relational Queries in Database Education

Conference Leibniz International Proceedings in Informatics, LIPIcs · March 1, 2024 Data analytics skills have become an indispensable part of any education that seeks to prepare its students for the modern workforce. Essential in this skill set is the ability to work with structured relational data. Relational queries are based on logic ... Full text Cite

Characterizing and Verifying Queries Via CINSGEN

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 4, 2023 Example database instances can be very helpful in understanding complex queries. Different examples may illustrate alternative situations in which answers emerge in the query results and can be useful for testing. Examples can also help reveal semantic dif ... Full text Cite

Interface Design for Crowdsourcing Hierarchical Multi-Label Text Annotations

Conference Conference on Human Factors in Computing Systems - Proceedings · April 19, 2023 Human data labeling is an important and expensive task at the heart of supervised learning systems. Hierarchies help humans understand and organize concepts. We ask whether and how concept hierarchies can inform the design of annotation interfaces to impro ... Full text Cite

Selectivity Functions of Range Queries are Learnable

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 This paper explores the use of machine learning for estimating the selectivity of range queries in database systems. Using classic learning theory for real-valued functions based on shattering dimension, we show that the selectivity function of a range spa ... Full text Cite

Computing Complex Temporal Join Queries Efficiently

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 This paper studies multi-way join queries over temporal data, where each tuple is associated with a valid time interval indicating when the tuple is valid. A temporal join requires that joining tuples' valid intervals intersect. Previous work on temporal j ... Full text Cite

Understanding Queries by Conditional Instances

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · June 10, 2022 A powerful way to understand a complex query is by observing how it operates on data instances. However, specific database instances are not ideal for such observations: they often include large amounts of superfluous details that are not only irrelevant t ... Full text Cite

Dynamic enumeration of similarity joins

Journal Article Leibniz International Proceedings in Informatics, LIPIcs · July 1, 2021 This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of n points A,B in ℝd, a metric φ(·), and a distance threshold r > 0, report all pairs of points (a, b) ∈ A × B with φ(a, b) ≤ r. Our goal is to store ... Full text Cite

Durable top-k instant-stamped temporal records with user-specified scoring functions

Journal Article Proceedings - International Conference on Data Engineering · April 1, 2021 A way of finding interesting or exceptional records from instant-stamped temporal data is to consider their "durability, "or, intuitively speaking, how well they compare with other records that arrived earlier or later, and how long they retain their supre ... Full text Cite

Efficiently Answering Durability Prediction Queries

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2021 We consider a class of queries called durability prediction queries that arise commonly in predictive analytics, where we use a given predictive model to answer questions about possible futures to inform our decisions. Examples of durability prediction que ... Full text Cite

Efficiently Answering Durability Prediction Queries.

Conference SIGMOD Conference · 2021 Cite

Poirot: Private contact summary aggregation: Poster abstract

Conference SenSys 2020 - Proceedings of the 2020 18th ACM Conference on Embedded Networked Sensor Systems · November 16, 2020 Physical distancing between individuals is key to preventing the spread of a disease such as COVID-19. On the one hand, having access to information about physical interactions is critical for decision makers; on the other, this information is sensitive an ... Full text Cite

Selecting data to clean for fact checking: Minimizing uncertainty vs. maximizing surprise

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We study the optimization problem of selecting numerical quantities to clean in order to fact-check claims based on such data. Oftentimes, such claims are technically correct, but they can still mislead for two reasons. First, data may contain uncertainty ... Full text Cite

Learning to sample: Counting with complex queries

Conference Proceedings of the VLDB Endowment · January 1, 2020 We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may not be applicab ... Full text Cite

I-Rex: An Interactive Relational Query Explainer for SQL

Journal Article Proceedings of the VLDB Endowment · January 1, 2020 We demonstrate I-REX1, a system designed to help users understand SQL query evaluation and debug SQL queries. I-REX lets users interactively “trace” the evaluation of complex SQL queries, including those with correlated subqueries. I-REX also explains why ... Full text Cite

Special Issue of DASFAA 2019

Journal Article Data Science and Engineering · September 1, 2019 Full text Cite

RATest: Explaining Wrong Relational Queries Using Small Examples.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 We present a system called RATEST, designed to help debug relational queries against reference queries and test database instances. In many applications, e.g., classroom learning and regression testing, we test the correctness of a user query Q by e ... Full text Cite

Explaining Wrong Queries Using Small Examples.

Journal Article Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2019 For testing the correctness of SQL queries, e.g., evaluating student submissions in a database course, a standard practice is to execute the query in question on some test database instance and compare its result with that of the correct query. Given two q ... Full text Cite

Introduction to the special issue on combating digital misinformation and disinformation

Journal Article Journal of Data and Information Quality · January 1, 2019 Full text Cite

QAGView: Interactively Summarizing High-Valued Aggregate Query Answers.

Conference Proceedings. ACM-SIGMOD International Conference on Management of Data · June 2018 Methods for summarizing and diversifying query results have drawn significant attention recently, because they help present query results with lots of tuples to users in more informative ways. We present QAGView (Quick AGgregate View), which provides a hol ... Full text Cite

Computing Optimal Repairs for Functional Dependencies.

Journal Article Proceedings of the ... ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems · June 2018 We investigate the complexity of computing an optimal repair of an inconsistent database, in the case where integrity constraints are Functional Dependencies (FDs). We focus on two types of repairs: an optimal subset repair (optimal S-repair) that is obtai ... Full text Cite

Interactive summarization and exploration of top aggregate query answers

Conference Proceedings of the VLDB Endowment · January 1, 2018 We present a system for summarization and interactive exploration of high-valued aggregate query answers to make a large set of possible answers more informative to the user. Our system outputs a set of clusters on the high-valued query answers showing the ... Full text Cite

Durable top-k queries on temporal data

Conference Proceedings of the VLDB Endowment · January 1, 2018 Many datasets have a temporal dimension and contain a wealth of historical information. When using such data to make decisions, we often want to examine not only the current snapshot of the data but also its history. For example, given a result object of a ... Full text Cite

Efficient knowledge graph accuracy evaluation

Journal Article Proceedings of the VLDB Endowment · January 1, 2018 Estimation of the accuracy of a large-scale knowledge graph (KG) often requires humans to annotate samples from the graph. How to obtain statistically meaningful estimates for accuracy evaluation while keeping human annotation costs low is a problem critic ... Full text Cite

On log-structured merge for solid-state drives

Conference Proceedings - International Conference on Data Engineering · May 16, 2017 Log-structure merge (LSM) is an increasingly prevalent approach to indexing, especially for modern writeheavy workloads. LSM organizes data in levels with geometrically increasing sizes. Records enter the top level; whenever a level fills up, it is merged ... Full text Cite

Optimizing iceberg queries with complex joins

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · May 9, 2017 Iceberg queries, commonly used for decision support, find groups whose aggregate values are above or below a threshold. In practice, iceberg queries are often posed over complex joins that are expensive to evaluate. This paper proposes a framework for comb ... Full text Cite

Data management in machine learning: Challenges, techniques, and systems

Conference Proceedings of the ACM SIGMOD International Conference on Management of Data · May 9, 2017 Large-scale data analytics using statistical machine learning (ML), popularly called advanced analytics, underpins many modern data-driven applications. The data management community has been working for over a decade on tackling data management-related ch ... Full text Cite

Computational fact checking through query perturbations

Journal Article ACM Transactions on Database Systems · January 1, 2017 Our media is saturated with claims of "facts" made from data. Database research has in the past focused on how to answer queries, but has not devotedmuch attention to discerningmore subtle qualities of the resulting claims, for example, is a claim "cherry- ... Full text Cite

Finding diverse, high-value representatives on a surface of answers

Conference Proceedings of the VLDB Endowment · January 1, 2017 In many applications, the system needs to selectively present a small subset of answers to users. The set of all possible answers can be seen as an elevation surface over a domain, where the elevation measures the quality of each answer, and the dimensions ... Full text Cite

Cümülön-D: Data analytics in a dynamic spot market

Conference Proceedings of the VLDB Endowment · January 1, 2017 We present a system called Cümülön-D for matrix-based data analysis in a spot market of a public cloud. Prices in such markets fluctuate over time: while users can acquire machines usually at a very low bid price, the cloud can terminate these machines as ... Full text Cite

Top-k Preferences in High Dimensions

Journal Article IEEE Transactions on Knowledge and Data Engineering · February 1, 2016 Given a set of objects O, each with d numeric attributes, a top-k preference scores these objects using a linear combination of their attribute values, where the weight on each attribute reflects the interest in this attribute. Given a query preference q, ... Full text Cite

Cümülön: MatrixBased data analytics in the cloud with spot instances

Chapter · January 1, 2016 We describe Cümülön, a system aimed at helping users develop and deploy matrix-based data analysis programs in a public cloud. A key feature of Cümülön is its end-to-end support for the so-called spot instances-machines whose market price fluctuates over t ... Cite

Efficient evaluation of object-centric exploration queries for visualization

Chapter · January 1, 2015 The most effective way to explore data is through visualizing the results of exploration queries. For example, an exploration query could be an aggregate of some measures over time intervals, and a pattern or abnormality can be discovered through a time se ... Full text Cite

Perturbation analysis of database queries

Conference Proceedings of the VLDB Endowment · January 1, 2015 We present a system, Perada, for parallel perturbation analysis of database queries. Perturbation analysis considers the results of a query evaluated with (a typically large number of) different parameter settings, to help discover leads and evaluate claim ... Full text Cite

Toward computational fact-checking

Journal Article Proceedings of the VLDB Endowment · January 1, 2014 Our news are saturated with claims of "facts" made from data.Database research has in the past focused on how to answer queries,but has not devoted much attention to discerning more subtle qualities of the resulting claims, e.g., is a claim "cherry-picking ... Full text Cite

Top-k preferences in high dimensions

Journal Article Proceedings - International Conference on Data Engineering · January 1, 2014 Given a set of objects O, each with d numeric attributes, a top-k preference scores these objects using a linear combination of their attribute values, where the weight on each attribute reflects the interest in this attribute. Given a query preference q, ... Full text Cite

Incremental discovery of prominent situational facts

Journal Article Proceedings - International Conference on Data Engineering · January 1, 2014 We study the novel problem of finding new, prominent situational facts, which are emerging statements about objects that stand out within certain contexts. Many such facts are newsworthy - e.g., an athlete's outstanding performance in a game, or a viral vi ... Full text Cite

ICheck: Computationally combating "lies, D - Ned Lies, and statistics"

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2014 Are you fed up with "lies, d - ned lies, and statistics" made up from data in our media? For claims based on structured data, we present a system to automatically assess the quality of claims (beyond their correctness) and counter misleading claims that ch ... Full text Cite

Data in, fact out: Automated monitoring of facts by FactWatcher

Journal Article Proceedings of the VLDB Endowment · January 1, 2014 Towards computational journalism, we present FactWatcher, a system that helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-th ... Full text Cite

Message from the program co-chairs

Conference ACM International Conference Proceeding Series · January 1, 2014 Cite

Cumulon: Optimizing statistical data analysis in the cloud

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · July 29, 2013 We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show h ... Full text Cite

Failure-aware cascaded suppression in wireless sensor networks

Journal Article IEEE Transactions on Knowledge and Data Engineering · April 8, 2013 Wireless sensor networks are widely used to continuously collect data from the environment. Because of energy constraints on battery-powered nodes, it is critical to minimize communication. Suppression has been proposed as a way to reduce communication by ... Full text Cite

Efficient external memory structures for range-aggregate queries

Journal Article Computational Geometry: Theory and Applications · April 1, 2013 We present external memory data structures for efficiently answering range-aggregate queries. The range-aggregate problem is defined as follows: Given a set of weighted points in Rd, compute the aggregate of the weights of the points that lie inside a d-di ... Full text Cite

Big and useful: What's in the data for me? (panel description)

Journal Article Proceedings of the VLDB Endowment · January 1, 2013 Full text Cite

Permuting data on randomaccess block storage

Journal Article Proceedings of the VLDB Endowment · January 1, 2013 Permutation is a fundamental operator for array data, with applications in, for example, changing matrix layouts and reorganizing data cubes. We consider the problem of permuting large quantities of data stored on secondary storage that supports fast rando ... Full text Cite

A practical concurrent index for solid-state drives

Journal Article ACM International Conference Proceeding Series · December 19, 2012 Solid-state drives are becoming a viable alternative to magnetic disks in database systems, but their performance characteristics, particularly those caused by their erase-before-write behavior, make conventional database indexes a poor fit. There have bee ... Full text Cite

On "one of the few" objects

Journal Article Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining · September 14, 2012 Objects with multiple numeric attributes can be compared within any "subspace" (subset of attributes). In applications such as computational journalism, users are interested in claims of the form: Karl Malone is one of the only two players in NBA history w ... Full text Cite

Subscriber assignment for wide-area content-based publish/subscribe

Journal Article IEEE Transactions on Knowledge and Data Engineering · August 29, 2012 We study the problem of assigning subscribers to brokers in a wide-area content-based publish/subscribe system. A good assignment should consider both subscriber interests in the event space and subscriber locations in the network space, and balance multip ... Full text Cite

Processing and notifying range top-k subscriptions

Journal Article Proceedings - International Conference on Data Engineering · July 30, 2012 We consider how to support a large number of users over a wide-area network whose interests are characterised by range top-k continuous queries. Given an object update, we need to notify users whose top-k results are affected. Simple solutions include usin ... Full text Cite

Processing a large number of continuous preference top-k queries

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · June 28, 2012 Given a set of objects, each with multiple numeric attributes, a (preference) top-k query retrieves the k objects with the highest scores according to a user preference, defined as a linear combination of attribute values. We consider the problem of proces ... Full text Cite

Optimizing I/O for big array analytics

Journal Article Proceedings of the VLDB Endowment · January 1, 2012 Big array analytics is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating intermediate results. In ... Full text Cite

Materialized views

Journal Article Foundations and Trends in Databases · December 1, 2011 Materialized views are queries whose results are stored and maintained in order to facilitate access to data in their underlying base tables. In the SQL setting, they are now considered a mature technology implemented by most commercial database systems an ... Full text Cite

Computational journalism: A call to arms to database researchers

Journal Article CIDR 2011 - 5th Biennial Conference on Innovative Data Systems Research, Conference Proceedings · October 11, 2011 Cite

Inferential ecosystem models, from network data to prediction.

Journal Article Ecological applications : a publication of the Ecological Society of America · July 2011 Recent developments suggest that predictive modeling could begin to play a larger role not only for data analysis, but also for data collection. We address the example of efficient wireless sensor networks, where inferential ecosystem models can be used to ... Full text Cite

Subscriber assignment for wide-area content-based publish/subscribe

Journal Article Proceedings - International Conference on Data Engineering · June 6, 2011 We study the problem of assigning subscribers to brokers in a wide-area content-based publish/subscribe system. A good assignment should consider both subscriber interests in the event space and subscriber locations in the network space, and balance multip ... Full text Cite

Storing matrices on disk: Theory and practice revisited

Conference Proceedings of the VLDB Endowment · January 1, 2011 We consider the problem of storing arrays on disk to support scalable data analysis involving linear algebra. We propose Linearized Array B-tree, or LAB-tree, which supports flexible array layouts and automatically adapts to varying sparsity across parts o ... Full text Cite

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · August 3, 2010 Cite

I/O-efficient statistical computing with RIOT

Journal Article Proceedings - International Conference on Data Engineering · June 1, 2010 Statistical analysis of massive data is becoming indispensable to science, commerce, and society today. Such analysis requires efficient, flexible storage support and special optimization techniques. In this demo, we present RIOT (R with I/O Transparency), ... Full text Cite

Optimizing complex extraction programs over evolving text data

Journal Article SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems · December 4, 2009 Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must a ... Full text Cite

RIOT: I/O efficient numerical computing without SQL

Journal Article CIDR 2009 - 4th Biennal Conference on Innovative Data Systems Research · December 1, 2009 R is a numerical computing environment that is widely popular for statistical data analysis. Like many such environments, R performs poorly for large datasets whose sizes exceed that of physical memory. We present our vision of RIOT (R with I/O Transparenc ... Cite

Input-sensitive scalable continuous join query processing

Journal Article ACM Transactions on Database Systems · August 1, 2009 This article considers the problem of scalably processing a large number of continuous queries. Our approach, consisting of novel data structures and algorithms and a flexible processing framework, advances the state-of-the-art in several ways. First, our ... Full text Cite

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · July 15, 2009 Cite

Weighted proximity best-joins for information retrieval

Journal Article Proceedings - International Conference on Data Engineering · July 8, 2009 We consider the problem of efficiently computing weighted proximity best-joins over multiple lists, with applications in information retrieval and extraction. We are given a multi-term query, and for each query term, a list of all its matches with scores, ... Full text Cite

ProSem: Scalable wide-area publish/subscribe

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · December 10, 2008 We demonstrate ProSem, a scalable wide-area publish/subscribe system that supports complex, stateful subscriptions as well as simple ones. One unique feature of ProSem is its cost-based joint optimization of both subscription processing and notification di ... Full text Cite

Message from the DMSN'08 organizing committee

Journal Article 5th International Workshop on Data Management for Sensor Networks, DMSN'08, In Conjunction with the 34th International Conference on Very Large Data Bases · December 1, 2008 Cite

A sampling-based approach to information recovery

Journal Article Proceedings - International Conference on Data Engineering · October 1, 2008 There has been a recent resurgence of interest in research on noisy and incomplete data. Many applications require information to be recovered from such data. Ideally, an approach for information recovery should have the following features. First, it shoul ... Full text Cite

Efficient information extraction over evolving text data

Journal Article Proceedings - International Conference on Data Engineering · October 1, 2008 Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up ... Full text Cite

End-to-End support for joins in large-scale publish/subscribe systems

Journal Article Proceedings of the VLDB Endowment · January 1, 2008 We address the problem of supporting a large number of select-join subscriptions for wide-area publish/subscribe. Subscriptions are joins over different tables, with varying interests expressed as range selection conditions over table attributes. Naive sch ... Full text Cite

Data-driven processing in sensor networks

Conference CIDR 2007 - 3rd Biennial Conference on Innovative Data Systems Research · December 1, 2007 Wireless sensor networks are poised to enable continuous data collection on unprecedented scales, in terms of area location and size, and frequency. This is a great boon to fields such as ecological modeling. We are collaborating with researchers to build ... Cite

Report on the fourth international workshop on Data Management for Sensor Networks (DMSN 2007)

Journal Article SIGMOD Record · December 1, 2007 A report on the Fourth International Workshop on Data Management for Sensor Networks (DMSN), which was held on September 24, 2007, is presented. The topics presented include a keystone address, three research sessions, panel discussion on the present and t ... Full text Cite

Query suspend and resume

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · October 30, 2007 Suppose a long-running analytical query is executing on a database server and has been allocated a large amount of physical memory. A high-priority task comes in and we need to run it immediately with all available resources. We have several choices. We co ... Full text Cite

BLINKS: Ranked keyword searches on graphs

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · October 30, 2007 Query processing over graph-structured data is enjoying a growing number of applications. A top-k keyword search query on a graph finds the top k answers according to some ranking criteria, where each answer is a substructure of the graph containing all qu ... Full text Cite

On suspending and resuming dataflows

Journal Article Proceedings - International Conference on Data Engineering · September 24, 2007 Full text Cite

Many-to-many aggregation for sensor networks

Journal Article Proceedings - International Conference on Data Engineering · September 24, 2007 Wireless sensor networks have enormous potential to aid data collection in a number of areas, such as environmental and wildlife research. In this paper, we address the challenges of supporting many-to-many aggregation in a sensor network. An application o ... Full text Cite

Suppression and failures in sensor networks: A Bayesian approach

Conference 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007 Sensor networks allow continuous data collection on unprecedented scales. The primary limiting factor of such networks is energy, of which communication is the dominant consumer. The default strategy of nodes continually reporting their data to the root re ... Cite

Value-based notification conditions in large-scale publish/subscribe systems

Conference 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings · January 1, 2007 We address the problem of providing scalable support for subscriptions with personalized value-based notification conditions in wide-area publish/subscribe systems. Notification conditions can be fine-tuned by subscribers, allowing precise and flexible con ... Cite

From data reverence to data relevance: Model-mediated wireless sensing of the physical environment

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2007 Wireless sensor networks can be viewed as the integration of three subsystems: a low-impact in situ data acquisition and collection system, a system for inference of process models from observed data and a priori information, and a system that controls the ... Full text Cite

On suspending and resuffning dataflows

Conference 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 · January 1, 2007 Link to item Cite

Many-to-many aggregation for sensor networks

Conference 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 · January 1, 2007 Link to item Cite

Energy-efficient monitoring of extreme values in sensor networks

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2006 Monitoring extreme values (MAX or MIN) is a fundamental problem in wireless sensor networks (and in general, complex dynamic systems). This problem presents very different algorithmic challenges from aggregate and selection queries, in the sense that an in ... Full text Cite

Constraint chaining: On energy-efficient continuous monitoring in sensor networks

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2006 Wireless sensor networks have created new opportunities for data collection in a variety of scenarios, such as environmental and industrial, where we expect data to be temporally and spatially correlated. Researchers may want to continuously collect all se ... Full text Cite

On the database/network interface in large-scale publish/subscribe systems

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2006 The work performed by a publish/subscribe system can conceptually be divided into subscription processing and notification dissemination. Traditionally, research in the database and networking communities has focused on these aspects in isolation. The inte ... Full text Cite

A sampling-based approach to optimizing top-k queries in sensor networks

Journal Article Proceedings - International Conference on Data Engineering · October 17, 2006 Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in battery-powered sensors. When approximation is acceptable, a model-driven approach to query p ... Full text Cite

Dual labeling: Answering graph reachability queries in constant time

Journal Article Proceedings - International Conference on Data Engineering · October 17, 2006 Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability que ... Full text Cite

Energy-efficient continuous isoline queries in sensor networks

Journal Article Proceedings - International Conference on Data Engineering · October 17, 2006 Environmental monitoring is a promising application for sensor networks. Many scenarios produce geographically correlated readings, making them visually interesting and good targets for the isoline query. This query depicts boundaries showing how values ch ... Full text Cite

Distributed network querying with bounded approximate caching

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · July 7, 2006 As networks continue to grow in size and complexity, distributed network monitoring and resource querying are becoming increasingly difficult. Our aim is to design, build, and evaluate a scalable infrastructure for answering queries over distributed measur ... Full text Cite

Model-driven dynamic control of embedded wireless sensor networks

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2006 Next-generation wireless sensor networks may revolutionize understanding of environmental change by assimilating heterogeneous data, assessing the relative value and costs of data collection, and scheduling activities accordingly. Thus, they are dynamic, d ... Full text Cite

Scalable continuous query processing by tracking hotspots

Conference VLDB 2006 - Proceedings of the 32nd International Conference on Very Large Data Bases · January 1, 2006 This paper considers the problem of scalably processing a large number of continuous queries. We propose a flexible framework with novel data structures and algorithms for group-processing and indexing continuous queries by exploiting potential overlaps in ... Cite

Asymmetric batch incremental view maintenance

Journal Article Proceedings - International Conference on Data Engineering · December 12, 2005 Incremental view maintenance has found a growing number of applications recently, including data warehousing, continuous query processing, publish/subscribe systems, etc. Batch processing of base table modifications, when applicable, can be much more effic ... Full text Cite

BOXes: Efficient maintenance of order-based labeling for dynamic XML data

Journal Article Proceedings - International Conference on Data Engineering · December 12, 2005 Order-based element labeling for tree-structured XML data is an important technique in XML processing. It lies at the core of many fundamental XML operations such as containment join and twig matching. While labeling for static XML documents is well unders ... Full text Cite

Monitoring continuous band-join queries over dynamic data

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · December 1, 2005 A continuous query is a standing query over a dynamic data set whose query result needs to be constantly updated as new data arrive. We consider the problem of constructing a data structure on a set of continuous band-join queries over two data sets R and ... Full text Cite

On joining and caching stochastic streams

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · December 1, 2005 We consider the problem of joining data streams using limited cache memory, with the goal of producing as many result tuples as possible from the cache. Many cache replacement heuristics have been proposed in the past. Their performance often relies on imp ... Full text Cite

Online view maintenance under a response-time constraint

Journal Article Lecture Notes in Computer Science · January 1, 2005 A materialized view is a certain synopsis structure precomputed from one or more data sets (called base tables) in order to facilitate various queries on the data. When the underlying base tables change, the materialized view also needs to be updated accor ... Full text Cite

Compact reachability labeling for graph-structured data

Journal Article International Conference on Information and Knowledge Management, Proceedings · January 1, 2005 Testing reachability between nodes in a graph is a well-known problem with many important applications, including knowledge representation, program analysis, and more recently, biological and ontology databases inferencing as well as XML query processing. ... Full text Cite

AUTOBIB: Automatic extraction of bibliographic information on the Web

Journal Article Proceedings of the International Database Engineering and Applications Symposium, IDEAS · October 25, 2004 The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and challenging task. In this work, ... Full text Cite

Multiresolution indexing of XML for frequent queries

Journal Article Proceedings - International Conference on Data Engineering · June 1, 2004 XML and other types of semi-structured data are typically represented by a labeled directed graph. To speed up path expression queries over the graph, a variety of structural indexes have been proposed. They usually work by partitioning nodes in the data g ... Cite

NEXSORT: Sorting XML in external memory

Journal Article Proceedings - International Conference on Data Engineering · June 1, 2004 XML plays an important role in delivering data over the Internet, and the need to store and manipulate XML in its native format has become increasingly relevant. This growing need necessitates work on developing native XML operators, especially for one as ... Full text Cite

Incremental maintenance of XML structural indexes

Journal Article Proceedings of the ACM SIGMOD International Conference on Management of Data · January 1, 2004 Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support officient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-index, ... Full text Cite

Efficient maintenance of materialized top-k views

Journal Article Proceedings - International Conference on Data Engineering · December 2, 2003 We tackle the problem of maintaining materialized top-k views in this paper. Top-k queries, including MIN and MAX as important special cases, occur frequently in common database workloads. A top-k view can be materialized to improve query performance, but ... Cite

Incremental computation and maintenance of temporal aggregates

Journal Article VLDB Journal · October 1, 2003 We consider the problems of computing aggregation queries in temporal databases and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging since a single data update can cause aggregate results to c ... Full text Cite

I/O-efficient structures for orthogonal range-max and stabbing-max queries

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2003 We develop several linear or near-linear space and I/O-efficient dynamic data structures for orthogonal range-max queries and stabbing-max queries. Given a set of N weighted points in ℝd, the range-max problem asks for the maximum-weight point in a query h ... Full text Cite

Recent Progress on Selected Topics in Database Research - A Report by Nine Young Chinese Researchers Working in the United States

Journal Article Journal of Computer Science and Technology · January 1, 2003 The study on database technologies, or more generally, the technologies of data and information management, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play ... Full text Cite

TupleRank and implicit relationship discovery in relational databases

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2003 Google's successful PageRank brings to the Web an order that well reflects the relative importance of Web pages. Inspired by PageRank, we propose a similar scheme called TupleRank for ranking tuples in a relational database. Database tuples naturally relat ... Full text Cite

Incremental computation and maintenance of temporal aggregates

Journal Article Proceedings - International Conference on Data Engineering · January 1, 2001 We consider the problems of computing aggregation queries in temporal databases, and of maintaining materialized temporal aggregate views efficiently. The latter problem is particularly challenging since a single data update can cause aggregate results to ... Cite

Performance issues in incremental warehouse maintenance

Conference Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00 · December 1, 2000 A well-known challenge in data warehousing is the efficient incremental maintenance of warehouse data in the presence of source data updates. In this paper. we identify several critical data representation and algorithmic choices that must be made when dev ... Cite

Temporal view self-maintenance

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2000 View self-maintenance refers to maintaining materialized views without accessing base data. Self-maintenance is particularly useful in data warehousing settings, where base data comes from sources that may be inaccessible. Self-maintenance has been studied ... Full text Cite

TIP: A Temporal Extension to Informix

Journal Article SIGMOD Record · January 1, 2000 Commercial relational database systems today provide only limited temporal support. To address the needs of applications requiring rich temporal data and queries, we have built TIP (Temporal Information Processor), a temporal extension to the Informix data ... Full text Cite

Maintaining temporal views over non-temporal information sources for data warehousing

Journal Article Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 1998 An important use of data warehousing is to provide temporal views over the history of source data that may itself be non-temporal. While recent work in view maintenance is applicable to data warehousing, only non-temporal views have been considered. In thi ... Full text Cite