Scholars@Duke publication: Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases

Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases

Publication , Conference

Levner, E; Kriheli, B; Benis, A; Ptuskin, A; Elalouf, A; Hovav, S; Ashkenazi, S

Published in: Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics

January 1, 2020

Published version (DOI)

When databases are at risk of containing erroneous, redundant, or obsolete data, a cleaning procedure is used to detect, correct or remove such undesirable records. We propose a methodology for improving data cleaning efficiency in a large hierarchical database. The methodology relies on Shannon’s information entropy for measuring the amount of information stored in databases. This approach, which builds on previously-gathered statistical data regarding the prevalence of errors in the database, enables the decision maker to determine which components of the database are likely to have undergone more information loss, and thus to prioritize those components for cleaning. In particular, in cases where the cleaning process is iterative (from the root node down), the entropic approach produces a scientifically motivated stopping rule that determines the optimal (i.e. minimally required) number of tiers in the hierarchical database that need to be examined. This stopping rule defines a more streamlined representation of the database, in which less informative tiers are eliminated.

Duke Scholars

Author Arriel Benis Biomedical Engineering

Published In

Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics

DOI

10.1007/978-3-030-59612-5_1

EISSN

1611-3349

ISSN

0302-9743

Publication Date

January 1, 2020

Volume

12402 LNCS

Start / End Page

3 / 12

Related Subject Headings

Artificial Intelligence & Image Processing
46 Information and computing sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Levner, E., Kriheli, B., Benis, A., Ptuskin, A., Elalouf, A., Hovav, S., & Ashkenazi, S. (2020). Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases. In Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics (Vol. 12402 LNCS, pp. 3–12). https://doi.org/10.1007/978-3-030-59612-5_1

Levner, E., B. Kriheli, A. Benis, A. Ptuskin, A. Elalouf, S. Hovav, and S. Ashkenazi. “Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases.” In Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, 12402 LNCS:3–12, 2020. https://doi.org/10.1007/978-3-030-59612-5_1.

Levner E, Kriheli B, Benis A, Ptuskin A, Elalouf A, Hovav S, et al. Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases. In: Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics. 2020. p. 3–12.

Levner, E., et al. “Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases.” Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 12402 LNCS, 2020, pp. 3–12. Scopus, doi:10.1007/978-3-030-59612-5_1.

Levner E, Kriheli B, Benis A, Ptuskin A, Elalouf A, Hovav S, Ashkenazi S. Entropy-Based Approach to Efficient Cleaning of Big Data in Hierarchical Databases. Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics. 2020. p. 3–12.

Published In

Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics

DOI

10.1007/978-3-030-59612-5_1

EISSN

1611-3349

ISSN

0302-9743

Publication Date

January 1, 2020

Volume

12402 LNCS

Start / End Page

3 / 12

Related Subject Headings

Artificial Intelligence & Image Processing
46 Information and computing sciences