The IBP compound Dirichlet process and its application to focused topic modeling


Journal Article

The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership model-each data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples across-data prevalence and within-data proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's "focus", is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDP-based topic model. Copyright 2010 by the author(s)/owner(s).

Duke Authors

Cited Authors

  • Williamson, S; Wang, C; Heller, KA; Blei, DM

Published Date

  • September 17, 2010

Published In

  • Icml 2010 Proceedings, 27th International Conference on Machine Learning

Start / End Page

  • 1151 - 1158

Citation Source

  • Scopus