Augmented Latent Dirichlet Allocation (Lda) Topic Model with Gaussian Mixture Topics
Latent Dirichlet allocation (LDA) is a statistical model that is often used to discover topics or themes in a large collection of documents. In the LDA model, topics are modeled as discrete distributions over a finite vocabulary of words. The LDA is also a popular choice to model other datasets spanning a discrete domain, such as population genetics and social networks. However, in order to model data spanning a continuous domain with the LDA, discrete approximations of the data need to be made. These discrete approximations to continuous data can lead to loss of information and may not represent the true structure of the underlying data. We present an augmented version of the LDA topic model, where topics are represented using Gaussian mixture models (GMMs), which are multi-modal distributions spanning a continuous domain. This augmentation of the LDA topic model with Gaussian mixture topics is denoted by the GMM-LDA model. We use Gibbs sampling to infer model parameters. We demonstrate the utility of the GMM-LDA model by applying it to the problem of clustering sleep states in electroencephalography (EEG) data. Results are presented demonstrating superior clustering performance with our GMM-LDA algorithm compared to the standard LDA and other clustering algorithms.