Fit a Topic Model Via a Unified API

Fit a topic model with a unified API across text2vec, topicmodels, seededlda, topicmodels.etm, and stm. The fitted object stores both the raw backend fit and, by default, cached DTW/TWW outputs following the convention of Lewis and Grossetti (2022):

Usage

fit_topic_model(
  x,
  engine,
  model,
  k = NULL,
  method = NULL,
  docvars = TRUE,
  doc_data = NULL,
  return_dtw = TRUE,
  return_tww = TRUE,
  keep_backend_data = FALSE,
  control = list(model = list(), fit = list(), optimizer = list()),
  dictionary = NULL,
  seedwords = NULL,
  initial_model = NULL
)

Arguments

x

A document-feature input. Supported classes are dgCMatrix-class, dfm, and DocumentTermMatrix.

engine

Backend package. One of "text2vec", "topicmodels", "seededlda", "topicmodels.etm", or "stm".

model

Model family within the selected backend. Supported combinations are:

engine = "text2vec" with model = "lda"
engine = "topicmodels" with model = "lda" or "ctm"
engine = "seededlda" with model = "lda", "seqlda", or "seededlda"
engine = "topicmodels.etm" with model = "etm"
engine = "stm" with model = "stm"

k

Number of topics $K$. Required for all supported models except engine = "seededlda", model = "seededlda".

method

Fitting method within the selected model family.

topicmodels + lda: "VEM" (default) or "Gibbs"
topicmodels + ctm: "VEM" only
text2vec + lda: NULL only
seededlda: NULL only
topicmodels.etm + etm: NULL only
stm + stm: NULL only

docvars

Should a compact document-variable table be stored alongside the fitted model? Defaults to TRUE. Stored docvars always include the fitted doc_id values and, when x is a dfm, any available document variables. This does not retain original text.

doc_data

Optional sidecar document data to store for downstream enrichment. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id. Text can only be attached downstream when this sidecar contains text or when it is supplied as a corpus.

return_dtw

Should document-topic-weights (DTW) be cached in the returned object? Defaults to TRUE.

return_tww

Should topic-term-weights (TWW) be cached in the returned object? Defaults to TRUE.

keep_backend_data

Should backend model objects that retain the full input document-feature matrix keep it? Defaults to FALSE. This currently affects only engine = "seededlda", whose textmodel_*() fits store the entire input dfm in model_object$data; by default that slot is replaced with a zero-count dfm of identical dimensions, dimnames, and docvars, so prediction, printing, and every extraction path keep working while the fit no longer carries a full copy of the corpus. Set to TRUE to retain the original counts (e.g. if you post-process the raw backend object with seededlda functions that read them). See the Memory profile of a fit section.

control

A named list of backend controls with optional model, fit, and optimizer entries. Use control$model for model-construction arguments, control$fit for fitting arguments, and control$optimizer for ETM optimizer arguments.

text2vec: control$model is forwarded to LDA$initialize() and control$fit is forwarded to LDA$fit_transform(); control$optimizer must be empty
topicmodels: control$model must be empty and control$fit is passed as backend control =; control$optimizer must be empty
seededlda: control$model must be empty and control$fit is spliced into the selected textmodel_*() call; control$optimizer must be empty
topicmodels.etm: control$model is forwarded to topicmodels.etm::ETM(), control$fit is forwarded to $fit(...), and control$optimizer is forwarded to torch::optim_adam(params = model$parameters, ...)
stm: control$fit is forwarded to stm::stm(), including prevalence, data, seed, max.em.its, init.type, and verbose; control$model and control$optimizer must be empty

dictionary

Dictionary required for engine = "seededlda", model = "seededlda".

seedwords

Optional seedwords argument forwarded only to engine = "topicmodels", model = "lda", method = "Gibbs".

initial_model

Optional previously fitted model passed to backend model = arguments where supported.

Value

An S3 object of class c("nlp_topic_fit", "list"). It is a named list with these fields:

engine: backend package used for estimation.
model: model family requested.
method: fitting method used, if applicable.
model_object: raw backend fit.
dtw: cached DTW matrix with doc_id rownames and Topic### columns, or NULL.
tww: cached TWW matrix with Topic### rownames and term columns, or NULL.
doc_ids: fitted document IDs in model order.
vocab: fitted vocabulary in term order.
docvars: compact stored docvars keyed by doc_id, or NULL.
doc_data: stored sidecar document data, or NULL.
hyperparameters: standardized topic-model hyperparameters. Use get_topic_hyperparameters() for a stable tabular accessor.
backend_control: sanitized backend-native model, fit, and optimizer controls after defaults and package-level normalization are applied.
call: matched function call.

Users access these components with $, for example fit$dtw or fit$model_object.

Details

fit_topic_model() standardizes model fitting while preserving the original backend object in model_object. That design avoids brittle inheritance across R6, S4, and list-based classes while still providing a stable package interface for downstream helpers such as get_dtw(), get_tww(), predict_topic_model(), get_top_terms(), and plot_dtw().

The standardized DTW/TWW outputs always use topic identifiers of the form Topic001, Topic002, and so on, regardless of backend-specific naming.

Stored docvars and doc_data are used only for downstream alignment and enrichment. They are never passed to the backend estimator itself.

Memory profile of a fit

An nlp_topic_fit can hold up to three representations of the model's weight matrices: the backend model_object (whose gamma/theta and beta/phi live inside it for the topicmodels, seededlda, and stm engines), the standardized dense dtw (documents x topics), and the standardized dense tww (topics x vocabulary). The tww cache dominates for large or n-gram vocabularies: it holds 8 * k * length(vocab) bytes.

return_tww = FALSE / return_dtw = FALSE skip the standardized caches; get_tww(), get_dtw(), get_top_terms(), evaluate_topic_model(), and summarize_topics() reconstruct the matrix on demand from model_object - at most once per call - so lean fits now evaluate at the same speed as cached ones and the flags are a pure memory saving. (text2vec and ETM fits cannot reconstruct DTW after the fact; keep return_dtw = TRUE for those engines.)
keep_backend_data = FALSE (default) additionally stops seededlda fits from carrying a full copy of the input dfm inside model_object$data.
The corpus itself is never required by a fitted object: training is passed explicitly to evaluate_topic_model() when needed.

ETM requires control$model$embeddings, supplied either as a single integer embedding dimension or as a pretrained embedding matrix. When learned embeddings are requested, vocab defaults to the input terms unless explicitly supplied. When pretrained embeddings are supplied, the input vocabulary is aligned to the embedding rownames; unmatched terms and any documents that become empty after alignment are dropped with a warning while preserving surviving doc_id, docvars, and doc_data alignment. Using engine = "topicmodels.etm" also requires both the topicmodels.etm package and a working torch backend. Installing the R torch package is not sufficient by itself on a clean machine; run torch::install_torch() and confirm that torch::torch_is_installed() returns TRUE.

STM support covers prevalence covariates but not content covariates. Pass prevalence formulas and metadata through control$fit$prevalence and control$fit$data. If x is a dfm with document variables, those docvars are used as STM metadata when control$fit$data is omitted. STM content covariates are not supported because they imply covariate-specific topic-word distributions, while NLPstudio currently standardizes one TWW matrix per fit. For STM interpretation after fitting, use get_stm_topic_labels(), summarize_stm_topics(), and estimate_stm_topic_effects().

The API currently covers these model families and fitting algorithms:

LDA via text2vec, topicmodels, and seededlda
CTM via topicmodels
WarpLDA as the text2vec estimation algorithm for LDA
Sequential LDA via seededlda
Seeded LDA via seededlda
Embedded Topic Models via topicmodels.etm
Structural Topic Models with prevalence covariates via stm

References

Lewis, C. M., & Grossetti, F. (2022). A statistical approach for optimal topic model identification. Journal of Machine Learning Research, 23(58), 1-20.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.

Blei, D. M., & Lafferty, J. D. (2006). Correlated topic models. Advances in Neural Information Processing Systems, 18, 147.

Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. doi:10.1214/07-AOAS114 . The Annals of Applied Statistics, 1(1), 17-35.

Chen, J., Li, K., Zhu, J., & Chen, W. (2016). WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proceedings of the VLDB Endowment, 9(10), 744-755.

Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439-453.

Du, L., Buntine, W. L., Jin, H., & Chen, C. (2012). Sequential latent Dirichlet allocation. doi:10.1007/s10115-011-0425-1 . Knowledge and Information Systems, 31(3), 475-503.

Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. doi:10.1109/ICDMW.2011.125 . In 2011 IEEE 11th International Conference on Data Mining Workshops, 81-88.

Jagarlamudi, J., Daume III, H., & Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 204-213.

Watanabe, K., & Zhou, Y. (2022). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. DOI: 10.1177/0894439320907027. Social Science Computer Review, 40(2), 346-366.

Watanabe, K., & Baturo, A. (2024). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. DOI: 10.1177/08944393231178605. Social Science Computer Review, 42(1), 224-248.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  control = list(
    model = list(doc_topic_prior = 0.1, topic_word_prior = 0.01),
    fit = list(n_iter = 25, progressbar = FALSE)
  )
)

class(fit)
#> [1] "nlp_topic_fit" "list"         
names(fit)
#>  [1] "engine"          "model"           "method"          "model_object"   
#>  [5] "dtw"             "tww"             "doc_ids"         "vocab"          
#>  [9] "docvars"         "doc_data"        "hyperparameters" "backend_control"
#> [13] "call"           

if (requireNamespace("topicmodels", quietly = TRUE)) {
  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "lda",
    k = 2,
    control = list(
      fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
    )
  )

  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "ctm",
    k = 2,
    control = list(
      fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
    )
  )
}
#> <nlp_topic_fit>
#>   engine: topicmodels
#>   model: ctm (VEM)
#>   documents: 4
#>   topics: 2
#>   terms: 3
#>   cached DTW: TRUE
#>   cached TWW: TRUE
#>   stored docvars: TRUE
#>   stored doc_data: FALSE

if (requireNamespace("seededlda", quietly = TRUE)) {
  fit_topic_model(
    dtm,
    engine = "seededlda",
    model = "lda",
    k = 2,
    control = list(fit = list(max_iter = 100, verbose = FALSE))
  )

  suppressWarnings(
    fit_topic_model(
      dtm,
      engine = "seededlda",
      model = "seqlda",
      k = 2,
      control = list(fit = list(max_iter = 100, verbose = FALSE))
    )
  )

  dict <- quanteda::dictionary(list(
    topic_a = c("term1", "term2"),
    topic_b = c("term3")
  ))

  fit_topic_model(
    dtm,
    engine = "seededlda",
    model = "seededlda",
    dictionary = dict,
    control = list(fit = list(max_iter = 100, verbose = FALSE))
  )
}
#> <nlp_topic_fit>
#>   engine: seededlda
#>   model: seededlda
#>   documents: 4
#>   topics: 2
#>   terms: 3
#>   cached DTW: TRUE
#>   cached TWW: TRUE
#>   stored docvars: TRUE
#>   stored doc_data: FALSE
if (FALSE) { # requireNamespace("topicmodels.etm", quietly = TRUE) && requireNamespace("torch", quietly = TRUE) && torch::torch_is_installed()
fit_topic_model(
  dtm,
  engine = "topicmodels.etm",
  model = "etm",
  k = 2,
  control = list(
    model = list(embeddings = 5),
    fit = list(epoch = 5, batch_size = 2, normalize = TRUE),
    optimizer = list(lr = 0.005, weight_decay = 1.2e-06)
  )
)

embeddings <- matrix(
  seq_len(ncol(dtm) * 4),
  nrow = ncol(dtm),
  ncol = 4,
  dimnames = list(colnames(dtm), NULL)
)

fit_topic_model(
  dtm,
  engine = "topicmodels.etm",
  model = "etm",
  k = 2,
  control = list(
    model = list(embeddings = embeddings),
    fit = list(epoch = 5, batch_size = 2)
  )
)
}