Skip to contents

Repeatedly fit the same topic-model specification across multiple seeds and compare the resulting topics after label matching. The function is a transparent wrapper around fit_topic_model(): unless resampling is supplied, each run uses the same data, backend, model, topic count, method, controls, and additional arguments, changing only the random seed.

Usage

assess_topic_stability(
  x,
  engine = NULL,
  model = NULL,
  k = NULL,
  seeds = NULL,
  method = NULL,
  control = list(),
  resampling = NULL,
  ncores = 1L,
  return_fits = FALSE,
  ...
)

Arguments

x

Either a document-feature input accepted by fit_topic_model(), or a list of pre-fitted nlp_topic_fit objects.

engine

Backend package forwarded to fit_topic_model() when x is model input.

model

Model family forwarded to fit_topic_model() when x is model input.

k

Number of topics forwarded to fit_topic_model() when x is model input.

seeds

Integer vector of seeds. In repeated-fit mode this is required and must contain at least two unique integer seeds. In list-of-fits mode it is optional and, when supplied, must match the number of fits.

method

Fitting method forwarded to fit_topic_model().

control

Backend controls forwarded to fit_topic_model(). For engine = "topicmodels" or engine = "stm", each seed is also written to control$fit$seed before fitting so backend-native seeding is explicit.

resampling

Optional list with fraction, a number in (0, 1]. When supplied, each seed also draws that fraction of documents without replacement before fitting. Defaults to NULL, which means no resampling.

ncores

Integer. Number of PSOCK workers for repeated fitting. Defaults to 1L.

return_fits

Logical. Should fitted models be attached as attr(result, "fits")? Defaults to FALSE.

...

Additional arguments forwarded to fit_topic_model() in repeated-fit mode.

Value

An object of class c("nlp_topic_stability", "data.table") with one row per reference-topic/run comparison. Topics from all non-reference runs are matched to the first run. Columns include run metadata, matched topic IDs, cosine similarity, per-topic stability, per-run stability, aggregate stability, and model metadata.

Details

Topic labels are arbitrary across model runs. assess_topic_stability() therefore extracts standardized topic-word weights with get_tww(), aligns vocabularies, matches topics from each run to the first run, and reports matched-topic cosine similarities.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 1, 1, 0,  0, 1, 2, 1,
             0, 0, 1, 2,  1, 0, 1, 1,  1, 2, 0, 1),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)

assess_topic_stability(
  dtm,
  engine = "topicmodels",
  model = "lda",
  k = 2,
  method = "Gibbs",
  seeds = 1:2,
  control = list(fit = list(iter = 50, burnin = 0, thin = 1))
)
#> <nlp_topic_stability>
#>   K: 2
#>   runs compared: 1
#>   aggregate stability: 0.5196

fits <- lapply(1:2, function(s) {
  fit_topic_model(
    dtm,
    engine = "topicmodels",
    model = "lda",
    k = 2,
    method = "Gibbs",
    control = list(fit = list(seed = s, iter = 50, burnin = 0, thin = 1))
  )
})
assess_topic_stability(fits, seeds = 1:2)
#> <nlp_topic_stability>
#>   K: 2
#>   runs compared: 1
#>   aggregate stability: 0.5196