Document modeling is important for document retrieval and categorization. The probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) are popular paradigms of document models where word/document correlations are inferred by latent topics. In PLSA and LDA, the unseen words and documents are not explicitly represented at the same time. Model generalization is constrained. This paper presents the Bayesian latent topic clustering (BLTC) model for document representation. The posterior distributions combined by Dirichlet priors and multinomial distributions are not only calculated in document level but also in word level. The modeling of unseen words and documents is tackled. An efficient variational inference method based on Gibbs sampling is presented to calculate the posterior probability of complex variables. In the experiments on TREC and Reuters-21578, the proposed BLTC performs better than PLSA and LDA in model perplexity and classification accuracy.
|頁（從 - 到）||2162-2165|
|期刊||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|出版狀態||Published - 1 十二月 2008|
|事件||INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association - Brisbane, QLD, Australia|
持續時間: 22 九月 2008 → 26 九月 2008