Skip to main content
Journal cover image

Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.

Publication ,  Journal Article
Deleger, L; Molnar, K; Savova, G; Xia, F; Lingren, T; Li, Q; Marsolo, K; Jegga, A; Kaiser, M; Stoutenborough, L; Solti, I
Published in: J Am Med Inform Assoc
January 1, 2013

OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated 'gold standard'. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

J Am Med Inform Assoc

DOI

EISSN

1527-974X

Publication Date

January 1, 2013

Volume

20

Issue

1

Start / End Page

84 / 94

Location

England

Related Subject Headings

  • United States
  • Technology Assessment, Biomedical
  • Reproducibility of Results
  • Observer Variation
  • Natural Language Processing
  • Medical Informatics
  • Information Dissemination
  • Humans
  • Hospitals, Pediatric
  • Electronic Health Records
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., … Solti, I. (2013). Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc, 20(1), 84–94. https://doi.org/10.1136/amiajnl-2012-001012
Deleger, Louise, Katalin Molnar, Guergana Savova, Fei Xia, Todd Lingren, Qi Li, Keith Marsolo, et al. “Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.J Am Med Inform Assoc 20, no. 1 (January 1, 2013): 84–94. https://doi.org/10.1136/amiajnl-2012-001012.
Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc. 2013 Jan 1;20(1):84–94.
Deleger, Louise, et al. “Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.J Am Med Inform Assoc, vol. 20, no. 1, Jan. 2013, pp. 84–94. Pubmed, doi:10.1136/amiajnl-2012-001012.
Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, Marsolo K, Jegga A, Kaiser M, Stoutenborough L, Solti I. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc. 2013 Jan 1;20(1):84–94.
Journal cover image

Published In

J Am Med Inform Assoc

DOI

EISSN

1527-974X

Publication Date

January 1, 2013

Volume

20

Issue

1

Start / End Page

84 / 94

Location

England

Related Subject Headings

  • United States
  • Technology Assessment, Biomedical
  • Reproducibility of Results
  • Observer Variation
  • Natural Language Processing
  • Medical Informatics
  • Information Dissemination
  • Humans
  • Hospitals, Pediatric
  • Electronic Health Records