Skip to main content
Journal cover image

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.

Publication ,  Journal Article
Deleger, L; Lingren, T; Ni, Y; Kaiser, M; Stoutenborough, L; Marsolo, K; Kouril, M; Molnar, K; Solti, I
Published in: J Biomed Inform
August 2014

OBJECTIVE: The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development. MATERIALS AND METHODS: In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes. RESULTS: Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings. DISCUSSION AND CONCLUSION: We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

J Biomed Inform

DOI

EISSN

1532-0480

Publication Date

August 2014

Volume

50

Start / End Page

173 / 183

Location

United States

Related Subject Headings

  • United States
  • Medical Informatics
  • Medical Informatics
  • Health Insurance Portability and Accountability Act
  • Electronic Health Records
  • Computer Security
  • Biomedical Engineering
  • 4601 Applied computing
  • 4203 Health services and systems
  • 11 Medical and Health Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Deleger, L., Lingren, T., Ni, Y., Kaiser, M., Stoutenborough, L., Marsolo, K., … Solti, I. (2014). Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform, 50, 173–183. https://doi.org/10.1016/j.jbi.2014.01.014
Deleger, Louise, Todd Lingren, Yizhao Ni, Megan Kaiser, Laura Stoutenborough, Keith Marsolo, Michal Kouril, Katalin Molnar, and Imre Solti. “Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.J Biomed Inform 50 (August 2014): 173–83. https://doi.org/10.1016/j.jbi.2014.01.014.
Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K, et al. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform. 2014 Aug;50:173–83.
Deleger, Louise, et al. “Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.J Biomed Inform, vol. 50, Aug. 2014, pp. 173–83. Pubmed, doi:10.1016/j.jbi.2014.01.014.
Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K, Kouril M, Molnar K, Solti I. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J Biomed Inform. 2014 Aug;50:173–183.
Journal cover image

Published In

J Biomed Inform

DOI

EISSN

1532-0480

Publication Date

August 2014

Volume

50

Start / End Page

173 / 183

Location

United States

Related Subject Headings

  • United States
  • Medical Informatics
  • Medical Informatics
  • Health Insurance Portability and Accountability Act
  • Electronic Health Records
  • Computer Security
  • Biomedical Engineering
  • 4601 Applied computing
  • 4203 Health services and systems
  • 11 Medical and Health Sciences