Skip to main content
Journal cover image

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).

Publication ,  Journal Article
Thomas, JA; Foraker, RE; Zamstein, N; Morrow, JD; Payne, PRO; Wilcox, AB; N3C Consortium,
Published in: J Am Med Inform Assoc
July 12, 2022

OBJECTIVE: This study sought to evaluate whether synthetic data derived from a national coronavirus disease 2019 (COVID-19) dataset could be used for geospatial and temporal epidemic analyses. MATERIALS AND METHODS: Using an original dataset (n = 1 854 968 severe acute respiratory syndrome coronavirus 2 tests) and its synthetic derivative, we compared key indicators of COVID-19 community spread through analysis of aggregate and zip code-level epidemic curves, patient characteristics and outcomes, distribution of tests by zip code, and indicator counts stratified by month and zip code. Similarity between the data was statistically and qualitatively evaluated. RESULTS: In general, synthetic data closely matched original data for epidemic curves, patient characteristics, and outcomes. Synthetic data suppressed labels of zip codes with few total tests (mean = 2.9 ± 2.4; max = 16 tests; 66% reduction of unique zip codes). Epidemic curves and monthly indicator counts were similar between synthetic and original data in a random sample of the most tested (top 1%; n = 171) and for all unsuppressed zip codes (n = 5819), respectively. In small sample sizes, synthetic data utility was notably decreased. DISCUSSION: Analyses on the population-level and of densely tested zip codes (which contained most of the data) were similar between original and synthetically derived datasets. Analyses of sparsely tested populations were less similar and had more data suppression. CONCLUSION: In general, synthetic data were successfully used to analyze geospatial and temporal trends. Analyses using small sample sizes or populations were limited, in part due to purposeful data label suppression-an attribute disclosure countermeasure. Users should consider data fitness for use in these cases.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

J Am Med Inform Assoc

DOI

EISSN

1527-974X

Publication Date

July 12, 2022

Volume

29

Issue

8

Start / End Page

1350 / 1365

Location

England

Related Subject Headings

  • United States
  • SARS-CoV-2
  • Medical Informatics
  • Humans
  • Cohort Studies
  • COVID-19
  • 46 Information and computing sciences
  • 42 Health sciences
  • 32 Biomedical and clinical sciences
  • 11 Medical and Health Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Thomas, J. A., Foraker, R. E., Zamstein, N., Morrow, J. D., Payne, P. R. O., Wilcox, A. B., & N3C Consortium, . (2022). Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc, 29(8), 1350–1365. https://doi.org/10.1093/jamia/ocac045
Thomas, Jason A., Randi E. Foraker, Noa Zamstein, Jon D. Morrow, Philip R. O. Payne, Adam B. Wilcox, and Adam B. N3C Consortium. “Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).J Am Med Inform Assoc 29, no. 8 (July 12, 2022): 1350–65. https://doi.org/10.1093/jamia/ocac045.
Journal cover image

Published In

J Am Med Inform Assoc

DOI

EISSN

1527-974X

Publication Date

July 12, 2022

Volume

29

Issue

8

Start / End Page

1350 / 1365

Location

England

Related Subject Headings

  • United States
  • SARS-CoV-2
  • Medical Informatics
  • Humans
  • Cohort Studies
  • COVID-19
  • 46 Information and computing sciences
  • 42 Health sciences
  • 32 Biomedical and clinical sciences
  • 11 Medical and Health Sciences