Suppressing model overfitting in mining concept-drifting data streams
Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification. Copyright 2006 ACM.