Provenance-based dictionary refinement in information extraction


Conference Paper

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings. Copyright © 2013 ACM.

Full Text

Duke Authors

Cited Authors

  • Roy, S; Chiticariu, L; Feldman, V; Reiss, FR; Zhu, H

Published Date

  • July 29, 2013

Published In

Start / End Page

  • 457 - 468

International Standard Serial Number (ISSN)

  • 0730-8078

International Standard Book Number 13 (ISBN-13)

  • 9781450320375

Digital Object Identifier (DOI)

  • 10.1145/2463676.2465284

Citation Source

  • Scopus