Scholars@Duke publication: “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts

“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts

Publication , Journal Article

Babbar, V; Guo, Z; Rudin, C

Published in: Journal of Machine Learning Research

January 1, 2025

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.