Scholars@Duke publication: Cleaning crowdsourced labels using oracles for statistical classification

Cleaning crowdsourced labels using oracles for statistical classification

Publication , Conference

Dolatshah, M; Teoh, M; Wang, J; Pei, J

Published in: Proceedings of the VLDB Endowment

January 1, 2018

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ground-truth labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. ground-truth labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model's true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget.

Duke Scholars

Author Jian Pei Computer Science

Published In

Proceedings of the VLDB Endowment

DOI

10.14778/3297753.3297758

EISSN

2150-8097

Publication Date

January 1, 2018

Volume

Issue

Start / End Page

376 / 389

Related Subject Headings

4605 Data management and data science
0807 Library and Information Studies
0806 Information Systems
0802 Computation Theory and Mathematics

Citation

APA

Chicago

ICMJE

MLA

NLM

Dolatshah, M., Teoh, M., Wang, J., & Pei, J. (2018). Cleaning crowdsourced labels using oracles for statistical classification. In Proceedings of the VLDB Endowment (Vol. 12, pp. 376–389). https://doi.org/10.14778/3297753.3297758

Dolatshah, M., M. Teoh, J. Wang, and J. Pei. “Cleaning crowdsourced labels using oracles for statistical classification.” In Proceedings of the VLDB Endowment, 12:376–89, 2018. https://doi.org/10.14778/3297753.3297758.

Dolatshah M, Teoh M, Wang J, Pei J. Cleaning crowdsourced labels using oracles for statistical classification. In: Proceedings of the VLDB Endowment. 2018. p. 376–89.

Dolatshah, M., et al. “Cleaning crowdsourced labels using oracles for statistical classification.” Proceedings of the VLDB Endowment, vol. 12, no. 4, 2018, pp. 376–89. Scopus, doi:10.14778/3297753.3297758.

Dolatshah M, Teoh M, Wang J, Pei J. Cleaning crowdsourced labels using oracles for statistical classification. Proceedings of the VLDB Endowment. 2018. p. 376–389.

Published In

Proceedings of the VLDB Endowment

DOI

10.14778/3297753.3297758

EISSN

2150-8097

Publication Date

January 1, 2018

Volume

Issue

Start / End Page

376 / 389

Related Subject Headings

4605 Data management and data science
0807 Library and Information Studies
0806 Information Systems
0802 Computation Theory and Mathematics