Skip to main content

Entity resolution with empirically motivated priors

Publication ,  Journal Article
Steorts, RC
Published in: Bayesian Analysis
January 1, 2015

Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian-type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey on income and wealth, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters.

Duke Scholars

Published In

Bayesian Analysis

DOI

EISSN

1931-6690

ISSN

1936-0975

Publication Date

January 1, 2015

Volume

10

Issue

4

Start / End Page

849 / 875

Related Subject Headings

  • Statistics & Probability
  • 4905 Statistics
  • 0104 Statistics
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Steorts, R. C. (2015). Entity resolution with empirically motivated priors. Bayesian Analysis, 10(4), 849–875. https://doi.org/10.1214/15-BA965SI
Steorts, R. C. “Entity resolution with empirically motivated priors.” Bayesian Analysis 10, no. 4 (January 1, 2015): 849–75. https://doi.org/10.1214/15-BA965SI.
Steorts RC. Entity resolution with empirically motivated priors. Bayesian Analysis. 2015 Jan 1;10(4):849–75.
Steorts, R. C. “Entity resolution with empirically motivated priors.” Bayesian Analysis, vol. 10, no. 4, Jan. 2015, pp. 849–75. Scopus, doi:10.1214/15-BA965SI.
Steorts RC. Entity resolution with empirically motivated priors. Bayesian Analysis. 2015 Jan 1;10(4):849–875.

Published In

Bayesian Analysis

DOI

EISSN

1931-6690

ISSN

1936-0975

Publication Date

January 1, 2015

Volume

10

Issue

4

Start / End Page

849 / 875

Related Subject Headings

  • Statistics & Probability
  • 4905 Statistics
  • 0104 Statistics