Assess Topic Stability Across Repeated Fits — assess_topic

Repeatedly fit the same topic-model specification across multiple seeds and compare the resulting topics after label matching. The function is a transparent wrapper around fit_topic_model(): unless resampling is supplied, each run uses the same data, backend, model, topic count, method, controls, and additional arguments, changing only the random seed.

Usage

assess_topic_stability(
  x,
  engine = NULL,
  model = NULL,
  k = NULL,
  seeds = NULL,
  method = NULL,
  control = list(),
  resampling = NULL,
  ncores = 1L,
  return_fits = FALSE,
  initial_fit = NULL,
  ...
)

Arguments

x: Either a document-feature input accepted by fit_topic_model(), or a list of pre-fitted nlp_topic_fit objects.
engine: Backend package forwarded to fit_topic_model() when x is model input.
model: Model family forwarded to fit_topic_model() when x is model input.
k: Number of topics forwarded to fit_topic_model() when x is model input.
seeds: Integer vector of seeds. In repeated-fit mode this is required and must contain at least two unique integer seeds. In list-of-fits mode it is optional and, when supplied, must match the number of fits.
method: Fitting method forwarded to fit_topic_model().
control: Backend controls forwarded to fit_topic_model(). For engine = "topicmodels" or engine = "stm", each seed is also written to control$fit$seed before fitting so backend-native seeding is explicit.
resampling: Optional list with fraction, a number in (0, 1]. When supplied, each seed also draws that fraction of documents without replacement before fitting. Defaults to NULL, which means no resampling.
ncores: Integer. Number of PSOCK workers for repeated fitting. Defaults to 1L.
return_fits: Logical. Should fitted models be attached as attr(result, "fits")? Defaults to FALSE.
initial_fit: Optional nlp_topic_fit used as the run for the first seed in repeated-fit mode, skipping one refit. Intended for grid search (see the stability_reuse_fit argument of select_k_topics()): the already-fitted evaluation model becomes the reference run, so results differ from refitting the first seed from scratch. The supplied fit must match engine, model, and k. Ignored in list mode. Defaults to NULL.
...: Additional arguments forwarded to fit_topic_model() in repeated-fit mode.

Value

An object of class c("nlp_topic_stability", "data.table") with one row per reference-topic/run comparison. Topics from all non-reference runs are matched to the first run. Columns include run metadata, matched topic IDs, cosine similarity, per-topic stability, per-run stability, aggregate stability, and model metadata.

Details

Topic labels are arbitrary across model runs. assess_topic_stability() therefore extracts standardized topic-word weights with get_tww(), aligns vocabularies, matches topics from each run to the first run, and reports matched-topic cosine similarities.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 1, 1, 0,  0, 1, 2, 1,
             0, 0, 1, 2,  1, 0, 1, 1,  1, 2, 0, 1),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)

assess_topic_stability(
  dtm,
  engine = "topicmodels",
  model = "lda",
  k = 2,
  method = "Gibbs",
  seeds = 1:2,
  control = list(fit = list(iter = 50, burnin = 0, thin = 1))
)
#> <nlp_topic_stability>
#>   K: 2
#>   runs compared: 1
#>   aggregate stability: 0.5196

fits <- lapply(1:2, function(s) {
  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "lda",
    k = 2,
    method = "Gibbs",
    control = list(fit = list(seed = s, iter = 50, burnin = 0, thin = 1))
  )
})
assess_topic_stability(fits, seeds = 1:2)
#> <nlp_topic_stability>
#>   K: 2
#>   runs compared: 1
#>   aggregate stability: 0.5196