Skip to contents

Fit a topic model with a unified API across text2vec, topicmodels, seededlda, topicmodels.etm, and stm. The fitted object stores both the raw backend fit and, by default, cached DTW/TWW outputs following the convention of Lewis and Grossetti (2022):

Usage

fit_topic_model(
  x,
  engine,
  model,
  k = NULL,
  method = NULL,
  docvars = TRUE,
  doc_data = NULL,
  return_dtw = TRUE,
  return_tww = TRUE,
  control = list(model = list(), fit = list(), optimizer = list()),
  dictionary = NULL,
  seedwords = NULL,
  initial_model = NULL
)

Arguments

x

A document-feature input. Supported classes are dgCMatrix-class, dfm, and DocumentTermMatrix.

engine

Backend package. One of "text2vec", "topicmodels", "seededlda", "topicmodels.etm", or "stm".

model

Model family within the selected backend. Supported combinations are:

  • engine = "text2vec" with model = "lda"

  • engine = "topicmodels" with model = "lda" or "ctm"

  • engine = "seededlda" with model = "lda", "seqlda", or "seededlda"

  • engine = "topicmodels.etm" with model = "etm"

  • engine = "stm" with model = "stm"

k

Number of topics \(K\). Required for all supported models except engine = "seededlda", model = "seededlda".

method

Fitting method within the selected model family.

  • topicmodels + lda: "VEM" (default) or "Gibbs"

  • topicmodels + ctm: "VEM" only

  • text2vec + lda: NULL only

  • seededlda: NULL only

  • topicmodels.etm + etm: NULL only

  • stm + stm: NULL only

docvars

Should a compact document-variable table be stored alongside the fitted model? Defaults to TRUE. Stored docvars always include the fitted doc_id values and, when x is a dfm, any available document variables. This does not retain original text.

doc_data

Optional sidecar document data to store for downstream enrichment. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id. Text can only be attached downstream when this sidecar contains text or when it is supplied as a corpus.

return_dtw

Should document-topic-weights (DTW) be cached in the returned object? Defaults to TRUE.

return_tww

Should topic-term-weights (TWW) be cached in the returned object? Defaults to TRUE.

control

A named list of backend controls with optional model, fit, and optimizer entries. Use control$model for model-construction arguments, control$fit for fitting arguments, and control$optimizer for ETM optimizer arguments.

  • text2vec: control$model is forwarded to LDA$initialize() and control$fit is forwarded to LDA$fit_transform(); control$optimizer must be empty

  • topicmodels: control$model must be empty and control$fit is passed as backend control =; control$optimizer must be empty

  • seededlda: control$model must be empty and control$fit is spliced into the selected textmodel_*() call; control$optimizer must be empty

  • topicmodels.etm: control$model is forwarded to topicmodels.etm::ETM(), control$fit is forwarded to $fit(...), and control$optimizer is forwarded to torch::optim_adam(params = model$parameters, ...)

  • stm: control$fit is forwarded to stm::stm(), including prevalence, data, seed, max.em.its, init.type, and verbose; control$model and control$optimizer must be empty

dictionary

Dictionary required for engine = "seededlda", model = "seededlda".

seedwords

Optional seedwords argument forwarded only to engine = "topicmodels", model = "lda", method = "Gibbs".

initial_model

Optional previously fitted model passed to backend model = arguments where supported.

Value

An S3 object of class c("nlp_topic_fit", "list"). It is a named list with these fields:

  • engine: backend package used for estimation.

  • model: model family requested.

  • method: fitting method used, if applicable.

  • model_object: raw backend fit.

  • dtw: cached DTW matrix with doc_id rownames and Topic### columns, or NULL.

  • tww: cached TWW matrix with Topic### rownames and term columns, or NULL.

  • doc_ids: fitted document IDs in model order.

  • vocab: fitted vocabulary in term order.

  • docvars: compact stored docvars keyed by doc_id, or NULL.

  • doc_data: stored sidecar document data, or NULL.

  • hyperparameters: standardized topic-model hyperparameters. Use get_topic_hyperparameters() for a stable tabular accessor.

  • backend_control: sanitized backend-native model, fit, and optimizer controls after defaults and package-level normalization are applied.

  • call: matched function call.

Users access these components with $, for example fit$dtw or fit$model_object.

Details

fit_topic_model() standardizes model fitting while preserving the original backend object in model_object. That design avoids brittle inheritance across R6, S4, and list-based classes while still providing a stable package interface for downstream helpers such as get_dtw(), get_tww(), predict_topic_model(), get_top_terms(), and plot_dtw().

The standardized DTW/TWW outputs always use topic identifiers of the form Topic001, Topic002, and so on, regardless of backend-specific naming.

Stored docvars and doc_data are used only for downstream alignment and enrichment. They are never passed to the backend estimator itself.

ETM requires control$model$embeddings, supplied either as a single integer embedding dimension or as a pretrained embedding matrix. When learned embeddings are requested, vocab defaults to the input terms unless explicitly supplied. When pretrained embeddings are supplied, the input vocabulary is aligned to the embedding rownames; unmatched terms and any documents that become empty after alignment are dropped with a warning while preserving surviving doc_id, docvars, and doc_data alignment. Using engine = "topicmodels.etm" also requires both the topicmodels.etm package and a working torch backend. Installing the R torch package is not sufficient by itself on a clean machine; run torch::install_torch() and confirm that torch::torch_is_installed() returns TRUE.

STM support covers prevalence covariates but not content covariates. Pass prevalence formulas and metadata through control$fit$prevalence and control$fit$data. If x is a dfm with document variables, those docvars are used as STM metadata when control$fit$data is omitted. STM content covariates are not supported because they imply covariate-specific topic-word distributions, while NLPstudio currently standardizes one TWW matrix per fit. For STM interpretation after fitting, use get_stm_topic_labels(), summarize_stm_topics(), and estimate_stm_topic_effects().

The API currently covers these model families and fitting algorithms:

  • LDA via text2vec, topicmodels, and seededlda

  • CTM via topicmodels

  • WarpLDA as the text2vec estimation algorithm for LDA

  • Sequential LDA via seededlda

  • Seeded LDA via seededlda

  • Embedded Topic Models via topicmodels.etm

  • Structural Topic Models with prevalence covariates via stm

References

Lewis, C. M., & Grossetti, F. (2022). A statistical approach for optimal topic model identification. Journal of Machine Learning Research, 23(58), 1-20.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.

Blei, D. M., & Lafferty, J. D. (2006). Correlated topic models. Advances in Neural Information Processing Systems, 18, 147.

Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of Science. The Annals of Applied Statistics, 1(1), 17-35.

Chen, J., Li, K., Zhu, J., & Chen, W. (2016). WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proceedings of the VLDB Endowment, 9(10), 744-755.

Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics, 8, 439-453.

Du, L., Buntine, W. L., Jin, H., & Chen, C. (2012). Sequential latent Dirichlet allocation. Knowledge and Information Systems, 31(3), 475-503.

Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic models. In 2011 IEEE 11th International Conference on Data Mining Workshops, 81-88.

Jagarlamudi, J., Daume III, H., & Udupa, R. (2012). Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 204-213.

Watanabe, K., & Zhou, Y. (2022). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches. DOI: 10.1177/0894439320907027. Social Science Computer Review, 40(2), 346-366.

Watanabe, K., & Baturo, A. (2024). Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences. DOI: 10.1177/08944393231178605. Social Science Computer Review, 42(1), 224-248.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  control = list(
    model = list(doc_topic_prior = 0.1, topic_word_prior = 0.01),
    fit = list(n_iter = 25, progressbar = FALSE)
  )
)

class(fit)
#> [1] "nlp_topic_fit" "list"         
names(fit)
#>  [1] "engine"          "model"           "method"          "model_object"   
#>  [5] "dtw"             "tww"             "doc_ids"         "vocab"          
#>  [9] "docvars"         "doc_data"        "hyperparameters" "backend_control"
#> [13] "call"           

if (requireNamespace("topicmodels", quietly = TRUE)) {
  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "lda",
    k = 2,
    control = list(
      fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
    )
  )

  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "ctm",
    k = 2,
    control = list(
      fit = list(seed = 1, em = list(iter.max = 5), var = list(iter.max = 5))
    )
  )
}
#> <nlp_topic_fit>
#>   engine: topicmodels
#>   model: ctm (VEM)
#>   documents: 4
#>   topics: 2
#>   terms: 3
#>   cached DTW: TRUE
#>   cached TWW: TRUE
#>   stored docvars: TRUE
#>   stored doc_data: FALSE

if (requireNamespace("seededlda", quietly = TRUE)) {
  fit_topic_model(
    dtm,
    engine = "seededlda",
    model = "lda",
    k = 2,
    control = list(fit = list(max_iter = 100, verbose = FALSE))
  )

  suppressWarnings(
    fit_topic_model(
      dtm,
      engine = "seededlda",
      model = "seqlda",
      k = 2,
      control = list(fit = list(max_iter = 100, verbose = FALSE))
    )
  )

  dict <- quanteda::dictionary(list(
    topic_a = c("term1", "term2"),
    topic_b = c("term3")
  ))

  fit_topic_model(
    dtm,
    engine = "seededlda",
    model = "seededlda",
    dictionary = dict,
    control = list(fit = list(max_iter = 100, verbose = FALSE))
  )
}
#> <nlp_topic_fit>
#>   engine: seededlda
#>   model: seededlda
#>   documents: 4
#>   topics: 2
#>   terms: 3
#>   cached DTW: TRUE
#>   cached TWW: TRUE
#>   stored docvars: TRUE
#>   stored doc_data: FALSE
if (FALSE) { # requireNamespace("topicmodels.etm", quietly = TRUE) && requireNamespace("torch", quietly = TRUE) && torch::torch_is_installed()
fit_topic_model(
  dtm,
  engine = "topicmodels.etm",
  model = "etm",
  k = 2,
  control = list(
    model = list(embeddings = 5),
    fit = list(epoch = 5, batch_size = 2, normalize = TRUE),
    optimizer = list(lr = 0.005, weight_decay = 1.2e-06)
  )
)

embeddings <- matrix(
  seq_len(ncol(dtm) * 4),
  nrow = ncol(dtm),
  ncol = 4,
  dimnames = list(colnames(dtm), NULL)
)

fit_topic_model(
  dtm,
  engine = "topicmodels.etm",
  model = "etm",
  k = 2,
  control = list(
    model = list(embeddings = embeddings),
    fit = list(epoch = 5, batch_size = 2)
  )
)
}