Scholars@Duke publication: Cleaning disguised missing data: A heuristic approach

Cleaning disguised missing data: A heuristic approach

Publication , Conference

Hua, M; Pei, J

Published in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

December 14, 2007

In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely, such as causing significant biases and misleading results in hypothesis tests, correlation analysis and regressions. The very limited previous studies on cleaning disguised missing data use outlier mining and distribution anomaly detection. They highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. To tackle the problem of cleaning disguised missing data, in this paper, we first model the distribution of disguised missing data, and propose the embedded unbiased sample heuristic. Then, we develop an effective and efficient method to identify the frequently used disguise values which capture the major body of the disguised missing data. Our method does not require any domain background knowledge to find the suspicious disguise values. We report an empirical evaluation using real data sets, which shows that our method is effective - the frequently used disguise values found by our method match the values identified by the domain experts nicely. Our method is also efficient and scalable for processing large data sets. © 2007 ACM.

Duke Scholars

Author Jian Pei Computer Science

Published In

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

DOI

10.1145/1281192.1281294

Publication Date

December 14, 2007

Start / End Page

950 / 958

Citation

APA

Chicago

ICMJE

MLA

NLM

Hua, M., & Pei, J. (2007). Cleaning disguised missing data: A heuristic approach. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 950–958). https://doi.org/10.1145/1281192.1281294

Hua, M., and J. Pei. “Cleaning disguised missing data: A heuristic approach.” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 950–58, 2007. https://doi.org/10.1145/1281192.1281294.

Hua M, Pei J. Cleaning disguised missing data: A heuristic approach. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007. p. 950–8.

Hua, M., and J. Pei. “Cleaning disguised missing data: A heuristic approach.” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp. 950–58. Scopus, doi:10.1145/1281192.1281294.

Hua M, Pei J. Cleaning disguised missing data: A heuristic approach. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007. p. 950–958.

Published In

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

DOI

10.1145/1281192.1281294

Publication Date

December 14, 2007

Start / End Page

950 / 958