Skip to main content

A Bayesian Approach to Graphical Record Linkage and Deduplication

Publication ,  Journal Article
Steorts, RC; Hall, R; Fienberg, SE
Published in: Journal of the American Statistical Association
October 1, 2016

We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture–recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature. Supplementary materials for this article are available online.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Journal of the American Statistical Association

DOI

EISSN

1537-274X

ISSN

0162-1459

Publication Date

October 1, 2016

Volume

111

Issue

516

Start / End Page

1660 / 1672

Related Subject Headings

  • Statistics & Probability
  • 4905 Statistics
  • 3802 Econometrics
  • 1603 Demography
  • 1403 Econometrics
  • 0104 Statistics
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Steorts, R. C., Hall, R., & Fienberg, S. E. (2016). A Bayesian Approach to Graphical Record Linkage and Deduplication. Journal of the American Statistical Association, 111(516), 1660–1672. https://doi.org/10.1080/01621459.2015.1105807
Steorts, R. C., R. Hall, and S. E. Fienberg. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association 111, no. 516 (October 1, 2016): 1660–72. https://doi.org/10.1080/01621459.2015.1105807.
Steorts RC, Hall R, Fienberg SE. A Bayesian Approach to Graphical Record Linkage and Deduplication. Journal of the American Statistical Association. 2016 Oct 1;111(516):1660–72.
Steorts, R. C., et al. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association, vol. 111, no. 516, Oct. 2016, pp. 1660–72. Scopus, doi:10.1080/01621459.2015.1105807.
Steorts RC, Hall R, Fienberg SE. A Bayesian Approach to Graphical Record Linkage and Deduplication. Journal of the American Statistical Association. 2016 Oct 1;111(516):1660–1672.

Published In

Journal of the American Statistical Association

DOI

EISSN

1537-274X

ISSN

0162-1459

Publication Date

October 1, 2016

Volume

111

Issue

516

Start / End Page

1660 / 1672

Related Subject Headings

  • Statistics & Probability
  • 4905 Statistics
  • 3802 Econometrics
  • 1603 Demography
  • 1403 Econometrics
  • 0104 Statistics