A Statistical Approach for Optimal Topic Model Identification

published
2022
machine-learning
NLP
methods
featured
Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents.
Authors

Craig M. Lewis

Francesco Grossetti

Published

February 1, 2022

PDF | JMLR | Code

Abstract

Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.

Citation

Lewis, C. M. & Grossetti, F. (2022). A Statistical Approach for Optimal Topic Model Identification. Journal of Machine Learning Research, 23(58), 1–20.

BibTeX

@article{lewis2022topic,
  title   = {A Statistical Approach for Optimal Topic Model Identification},
  author  = {Lewis, Craig M. and Grossetti, Francesco},
  journal = {Journal of Machine Learning Research},
  volume  = {23},
  number  = {58},
  pages   = {1--20},
  year    = {2022}
}