A Statistical Approach for Optimal Topic Model Identification
Abstract
Latent Dirichlet Allocation is a popular machine-learning technique that identifies latent structures in a corpus of documents. This paper addresses the ongoing concern that formal procedures for determining the optimal LDA configuration do not exist by introducing a set of parametric tests that rely on the assumed multinomial distribution specification underlying the original LDA model. Our methodology defines a set of rigorous statistical procedures that identify and evaluate the optimal topic model. The U.S. Presidential Inaugural Address Corpus is used as a case study to show the numerical results. We find that 92 topics best describe the corpus. We further validate the method through a simulation study confirming the superiority of our approach compared to other standard heuristic metrics like the perplexity index.
Citation
Lewis, C. M. & Grossetti, F. (2022). A Statistical Approach for Optimal Topic Model Identification. Journal of Machine Learning Research, 23(58), 1–20.
BibTeX
@article{lewis2022topic,
title = {A Statistical Approach for Optimal Topic Model Identification},
author = {Lewis, Craig M. and Grossetti, Francesco},
journal = {Journal of Machine Learning Research},
volume = {23},
number = {58},
pages = {1--20},
year = {2022}
}