Fit a topic model with a unified API across text2vec, topicmodels, seededlda, topicmodels.etm, and stm. The fitted object stores both the raw backend fit and, by default, cached DTW/TWW outputs following the convention of Lewis and Grossetti (2022):
Arguments
- x
A document-feature input. Supported classes are dgCMatrix-class, dfm, and
DocumentTermMatrix.- engine
Backend package. One of
"text2vec","topicmodels","seededlda","topicmodels.etm", or"stm".- model
Model family within the selected backend. Supported combinations are:
engine = "text2vec"withmodel = "lda"engine = "topicmodels"withmodel = "lda"or"ctm"engine = "seededlda"withmodel = "lda","seqlda", or"seededlda"engine = "topicmodels.etm"withmodel = "etm"engine = "stm"withmodel = "stm"
- k
Number of topics \(K\). Required for all supported models except
engine = "seededlda", model = "seededlda".- method
Fitting method within the selected model family.
topicmodels + lda:"VEM"(default) or"Gibbs"topicmodels + ctm:"VEM"onlytext2vec + lda:NULLonlyseededlda:NULLonlytopicmodels.etm + etm:NULLonlystm + stm:NULLonly
- docvars
Should a compact document-variable table be stored alongside the fitted model? Defaults to
TRUE. Stored docvars always include the fitteddoc_idvalues and, whenxis a dfm, any available document variables. This does not retain original text.- doc_data
Optional sidecar document data to store for downstream enrichment. Accepted inputs are a corpus, data.frame, or data.table keyed by
doc_id. Text can only be attached downstream when this sidecar contains text or when it is supplied as a corpus.- return_dtw
Should document-topic-weights (DTW) be cached in the returned object? Defaults to
TRUE.- return_tww
Should topic-term-weights (TWW) be cached in the returned object? Defaults to
TRUE.- control
A named list of backend controls with optional
model,fit, andoptimizerentries. Usecontrol$modelfor model-construction arguments,control$fitfor fitting arguments, andcontrol$optimizerfor ETM optimizer arguments.text2vec:control$modelis forwarded toLDA$initialize()andcontrol$fitis forwarded toLDA$fit_transform();control$optimizermust be emptytopicmodels:control$modelmust be empty andcontrol$fitis passed as backendcontrol =;control$optimizermust be emptyseededlda:control$modelmust be empty andcontrol$fitis spliced into the selectedtextmodel_*()call;control$optimizermust be emptytopicmodels.etm:control$modelis forwarded totopicmodels.etm::ETM(),control$fitis forwarded to$fit(...), andcontrol$optimizeris forwarded totorch::optim_adam(params = model$parameters, ...)stm:control$fitis forwarded tostm::stm(), includingprevalence,data,seed,max.em.its,init.type, andverbose;control$modelandcontrol$optimizermust be empty
- dictionary
Dictionary required for
engine = "seededlda", model = "seededlda".- seedwords
Optional
seedwordsargument forwarded only toengine = "topicmodels", model = "lda", method = "Gibbs".- initial_model
Optional previously fitted model passed to backend
model =arguments where supported.
Value
An S3 object of class c("nlp_topic_fit", "list"). It is a named
list with these fields:
engine: backend package used for estimation.model: model family requested.method: fitting method used, if applicable.model_object: raw backend fit.dtw: cached DTW matrix withdoc_idrownames andTopic###columns, orNULL.tww: cached TWW matrix withTopic###rownames and term columns, orNULL.doc_ids: fitted document IDs in model order.vocab: fitted vocabulary in term order.docvars: compact stored docvars keyed bydoc_id, orNULL.doc_data: stored sidecar document data, orNULL.hyperparameters: standardized topic-model hyperparameters. Useget_topic_hyperparameters()for a stable tabular accessor.backend_control: sanitized backend-native model, fit, and optimizer controls after defaults and package-level normalization are applied.call: matched function call.
Users access these components with $, for example fit$dtw or
fit$model_object.
Details
fit_topic_model() standardizes model fitting while preserving the original
backend object in model_object. That design avoids brittle inheritance
across R6, S4, and list-based classes while still providing a stable package
interface for downstream helpers such as get_dtw(), get_tww(),
predict_topic_model(), get_top_terms(), and plot_dtw().
The standardized DTW/TWW outputs always use topic identifiers of the form
Topic001, Topic002, and so on, regardless of backend-specific naming.
Stored docvars and doc_data are used only for downstream alignment and
enrichment. They are never passed to the backend estimator itself.
ETM requires control$model$embeddings, supplied either as a single integer
embedding dimension or as a pretrained embedding matrix. When learned
embeddings are requested, vocab defaults to the input terms unless
explicitly supplied. When pretrained embeddings are supplied, the input
vocabulary is aligned to the embedding rownames; unmatched terms and any
documents that become empty after alignment are dropped with a warning while
preserving surviving doc_id, docvars, and doc_data alignment.
Using engine = "topicmodels.etm" also requires both the topicmodels.etm
package and a working torch backend. Installing the R torch package is
not sufficient by itself on a clean machine; run torch::install_torch() and
confirm that torch::torch_is_installed() returns TRUE.
STM support covers prevalence covariates but not content
covariates. Pass prevalence formulas and metadata through
control$fit$prevalence and control$fit$data. If x is a
dfm with document variables, those docvars are used as STM
metadata when control$fit$data is omitted. STM content covariates are not
supported because they imply covariate-specific topic-word
distributions, while NLPstudio currently standardizes one TWW matrix per
fit. For STM interpretation after fitting, use get_stm_topic_labels(),
summarize_stm_topics(), and estimate_stm_topic_effects().
The API currently covers these model families and fitting algorithms:
LDA via text2vec, topicmodels, and seededlda
CTM via topicmodels
WarpLDA as the text2vec estimation algorithm for LDA
Sequential LDA via seededlda
Seeded LDA via seededlda
Embedded Topic Models via topicmodels.etm
Structural Topic Models with prevalence covariates via stm
References
Lewis, C. M., & Grossetti, F. (2022). A statistical approach for optimal topic model identification. Journal of Machine Learning Research, 23(58), 1-20.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.
Blei, D. M., & Lafferty, J. D. (2006). Correlated topic models. Advances in Neural Information Processing Systems, 18, 147.
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35.
Chen, J., Li, K., Zhu, J., & Chen, W. (2016). WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proceedings of the VLDB Endowment, 9(10), 744-755.
Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439-453.
Du, L., Buntine, W. L., Jin, H., & Chen, C. (2012). Sequential latent Dirichlet allocation. Knowledge and Information Systems, 31(3), 475-503.
Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. In 2011 IEEE 11th International Conference on Data Mining Workshops, 81-88.
Jagarlamudi, J., Daume III, H., & Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 204-213.
Watanabe, K., & Zhou, Y. (2022). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. DOI: 10.1177/0894439320907027. Social Science Computer Review, 40(2), 346-366.
Watanabe, K., & Baturo, A. (2024). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. DOI: 10.1177/08944393231178605. Social Science Computer Review, 42(1), 224-248.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(
c(1, 0, 1,
1, 1, 0,
0, 1, 1,
1, 1, 1),
nrow = 4,
byrow = TRUE
),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)
fit <- fit_topic_model(
dtm,
engine = "text2vec",
model = "lda",
k = 2,
control = list(
model = list(doc_topic_prior = 0.1, topic_word_prior = 0.01),
fit = list(n_iter = 25, progressbar = FALSE)
)
)
class(fit)
#> [1] "nlp_topic_fit" "list"
names(fit)
#> [1] "engine" "model" "method" "model_object"
#> [5] "dtw" "tww" "doc_ids" "vocab"
#> [9] "docvars" "doc_data" "hyperparameters" "backend_control"
#> [13] "call"
if (requireNamespace("topicmodels", quietly = TRUE)) {
fit_topic_model(
dtm,
engine = "topicmodels",
model = "lda",
k = 2,
control = list(
fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
)
)
fit_topic_model(
dtm,
engine = "topicmodels",
model = "ctm",
k = 2,
control = list(
fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
)
)
}
#> <nlp_topic_fit>
#> engine: topicmodels
#> model: ctm (VEM)
#> documents: 4
#> topics: 2
#> terms: 3
#> cached DTW: TRUE
#> cached TWW: TRUE
#> stored docvars: TRUE
#> stored doc_data: FALSE
if (requireNamespace("seededlda", quietly = TRUE)) {
fit_topic_model(
dtm,
engine = "seededlda",
model = "lda",
k = 2,
control = list(fit = list(max_iter = 100, verbose = FALSE))
)
suppressWarnings(
fit_topic_model(
dtm,
engine = "seededlda",
model = "seqlda",
k = 2,
control = list(fit = list(max_iter = 100, verbose = FALSE))
)
)
dict <- quanteda::dictionary(list(
topic_a = c("term1", "term2"),
topic_b = c("term3")
))
fit_topic_model(
dtm,
engine = "seededlda",
model = "seededlda",
dictionary = dict,
control = list(fit = list(max_iter = 100, verbose = FALSE))
)
}
#> <nlp_topic_fit>
#> engine: seededlda
#> model: seededlda
#> documents: 4
#> topics: 2
#> terms: 3
#> cached DTW: TRUE
#> cached TWW: TRUE
#> stored docvars: TRUE
#> stored doc_data: FALSE
if (FALSE) { # requireNamespace("topicmodels.etm", quietly = TRUE) && requireNamespace("torch", quietly = TRUE) && torch::torch_is_installed()
fit_topic_model(
dtm,
engine = "topicmodels.etm",
model = "etm",
k = 2,
control = list(
model = list(embeddings = 5),
fit = list(epoch = 5, batch_size = 2, normalize = TRUE),
optimizer = list(lr = 0.005, weight_decay = 1.2e-06)
)
)
embeddings <- matrix(
seq_len(ncol(dtm) * 4),
nrow = ncol(dtm),
ncol = 4,
dimnames = list(colnames(dtm), NULL)
)
fit_topic_model(
dtm,
engine = "topicmodels.etm",
model = "etm",
k = 2,
control = list(
model = list(embeddings = embeddings),
fit = list(epoch = 5, batch_size = 2)
)
)
}
