Automating entity matching model development
This paper seeks to answer one important but unexplored question for Entity Matching (EM): can we develop a good machine learning pipeline automatically for the EM task? If yes, to what extent the process can be automated? To answer this question, we find that a general-purpose AutoML tool cannot be directly applied to solve an EM problem, thus propose AutoML-EM, an automated model pipeline development solution tailored for EM. In reality, however, another bottleneck of EM problem is the insufficient labeled data. To mitigate this issue, active learning based solutions are widely adopted. Under this setting, we propose AutoML-EM-Active, investigating how to maximize the benefit of AutoML-EM with automatic data labeling. We provide fundamental insights into our solutions and conduct extensive experiments to examine their performance on benchmark datasets. The results suggest that AutoML-EM not only avoids human involvement in model development process but also reaches or exceeds the state-of-the-art EM performance, and AutoML-EM-Active improves the model performance under the active learning setting effectively.