Aggregate queries on probabilistic record linkages
Record linkage analysis, which matches records referring to the same real world entities from different data sets, is an important task in data integration. Uncertainty often exists in record linkages due to incompleteness or ambiguity in data. Fortunately, the state-of-the-art probabilistic record linkage methods are capable of computing the probability that two records referring to the same entity. In this paper, we study the novel aggregate queries on probabilistic record linkages, such as counting the number of matched records. We address several fundamental issues. First, we advocate that the answer to an aggregate query on probabilistic record linkages is a probability distribution of possible answers derived from possible worlds. Second, we identify the category of compatible linkages only on which the answers to aggregate queries can be determined properly when the probabilities of individual linkages are available but the joint distributions of multiple linkages are unavailable. Third, we give a quadratic exact algorithm and two approximation algorithms to answer aggregate queries. © 2012 ACM.