Skip to main content

“What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts

Publication ,  Journal Article
Babbar, V; Guo, Z; Rudin, C
Published in: Journal of Machine Learning Research
January 1, 2025

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts.

Duke Scholars

Published In

Journal of Machine Learning Research

EISSN

1533-7928

ISSN

1532-4435

Publication Date

January 1, 2025

Volume

26

Related Subject Headings

  • Artificial Intelligence & Image Processing
  • 4905 Statistics
  • 4611 Machine learning
  • 17 Psychology and Cognitive Sciences
  • 08 Information and Computing Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Babbar, V., Guo, Z., & Rudin, C. (2025). “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts. Journal of Machine Learning Research, 26.
Babbar, V., Z. Guo, and C. Rudin. ““What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts.” Journal of Machine Learning Research 26 (January 1, 2025).
Babbar V, Guo Z, Rudin C. “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts. Journal of Machine Learning Research. 2025 Jan 1;26.
Babbar, V., et al. ““What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts.” Journal of Machine Learning Research, vol. 26, Jan. 2025.
Babbar V, Guo Z, Rudin C. “What is Different Between These Datasets?” A Framework for Explaining Data Distribution Shifts. Journal of Machine Learning Research. 2025 Jan 1;26.

Published In

Journal of Machine Learning Research

EISSN

1533-7928

ISSN

1532-4435

Publication Date

January 1, 2025

Volume

26

Related Subject Headings

  • Artificial Intelligence & Image Processing
  • 4905 Statistics
  • 4611 Machine learning
  • 17 Psychology and Cognitive Sciences
  • 08 Information and Computing Sciences