Mining the most general multidimensional summarization of "probable groups" in data warehouses
Data summarization is an important data analysis task in data warehousing and online analytic processing. In this paper, we consider a novel type of summarization queries, probable group queries, such as "What are the groups of patients that have a 50% or more opportunity to get lung cancer than the average?" An aggregate cell satisfying the requirement is called a probable group. To make the answer succinct and effective, we propose that only the most general probable groups should be mined. For example, if both groups (smoking, drinking) and (smoking, *) are probable, then the former groups should not be returned. The problem of mining the most general probable groups is challenging since the probable groups can be widely scattered in the cube lattice, and do not present any monotonicity in group containment order. We extend the state-of-the-art BUC algorithm to tackle the problem, and develop techniques and heuristics to speed up the search. An extensive performance study is reported to illustrate the effect of our approach.