Scholars@Duke publication: Optimizing complex extraction programs over evolving text data

Optimizing complex extraction programs over evolving text data

Publication , Journal Article

Chen, F; Gao, BJ; Doan, AH; Yang, J; Ramakrishnan, R

Published in: SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems

December 4, 2009

Published version (DOI)

Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE "blackbox." In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional "workflow." In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach. © 2009 ACM.

Duke Scholars

Author Jun Yang Computer Science

Published In

SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems

DOI

10.1145/1559845.1559881

Publication Date

December 4, 2009

Start / End Page

321 / 334

Citation

APA

Chicago

ICMJE

MLA

NLM

Chen, F., Gao, B. J., Doan, A. H., Yang, J., & Ramakrishnan, R. (2009). Optimizing complex extraction programs over evolving text data. SIGMOD-PODS’09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems, 321–334. https://doi.org/10.1145/1559845.1559881

Chen, F., B. J. Gao, A. H. Doan, J. Yang, and R. Ramakrishnan. “Optimizing complex extraction programs over evolving text data.” SIGMOD-PODS’09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems, December 4, 2009, 321–34. https://doi.org/10.1145/1559845.1559881.

Chen F, Gao BJ, Doan AH, Yang J, Ramakrishnan R. Optimizing complex extraction programs over evolving text data. SIGMOD-PODS’09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems. 2009 Dec 4;321–34.

Chen, F., et al. “Optimizing complex extraction programs over evolving text data.” SIGMOD-PODS’09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems, Dec. 2009, pp. 321–34. Scopus, doi:10.1145/1559845.1559881.

Published In

SIGMOD-PODS'09 - Proceedings of the International Conference on Management of Data and 28th Symposium on Principles of Database Systems

DOI

10.1145/1559845.1559881

Publication Date

December 4, 2009

Start / End Page

321 / 334