Abstract
In studies of automatic text processing, it is popular to apply the probabilistic topic model to infer word correlation through latent topic variables. Probabilistic latent semantic analysis (PLSA) is corresponding to such model that each word in a document is seen as a sample from a mixture model where mixture components are modeled by multinomial distribution. Although PLSA model deals with the issue of multiple topics, each topic model is quite simple and the word burstiness phenomenon is not taken into account. In this study, we present a new Bayesian topic mixture model (BTMM) to overcome the burstiness problem inherent in multinomial distribution. Accordingly, we use the Dirichlet distribution for representation of topic information beyond document level. Conceptually, the documents in the same class are generated by the associated multinomial distribution. In the experiments on TREC text corpus, we show the results of average precision and model perplexity to demonstrate the superiority of using proposed BTMM method.
Original language | English |
---|---|
Number of pages | 15 |
State | Published - Sep 2007 |
Event | 19th Conference on Computational Linguistics and Speech Processing, ROCLING 2007 - Taipei, Taiwan Duration: 6 Sep 2007 → 7 Sep 2007 |
Conference
Conference | 19th Conference on Computational Linguistics and Speech Processing, ROCLING 2007 |
---|---|
Country/Territory | Taiwan |
City | Taipei |
Period | 6/09/07 → 7/09/07 |
Keywords
- Bayesian model
- Dirichlet Prior
- Graphical model
- Information retrieval
- PLSA