Repeatedly fit the same topic-model specification across multiple seeds and
compare the resulting topics after label matching. The function is a
transparent wrapper around fit_topic_model(): unless resampling is
supplied, each run uses the same data, backend, model, topic count, method,
controls, and additional arguments, changing only the random seed.
Usage
assess_topic_stability(
x,
engine = NULL,
model = NULL,
k = NULL,
seeds = NULL,
method = NULL,
control = list(),
resampling = NULL,
ncores = 1L,
return_fits = FALSE,
...
)Arguments
- x
Either a document-feature input accepted by
fit_topic_model(), or a list of pre-fittednlp_topic_fitobjects.- engine
Backend package forwarded to
fit_topic_model()whenxis model input.- model
Model family forwarded to
fit_topic_model()whenxis model input.- k
Number of topics forwarded to
fit_topic_model()whenxis model input.- seeds
Integer vector of seeds. In repeated-fit mode this is required and must contain at least two unique integer seeds. In list-of-fits mode it is optional and, when supplied, must match the number of fits.
- method
Fitting method forwarded to
fit_topic_model().- control
Backend controls forwarded to
fit_topic_model(). Forengine = "topicmodels"orengine = "stm", each seed is also written tocontrol$fit$seedbefore fitting so backend-native seeding is explicit.- resampling
Optional list with
fraction, a number in(0, 1]. When supplied, each seed also draws that fraction of documents without replacement before fitting. Defaults toNULL, which means no resampling.- ncores
Integer. Number of PSOCK workers for repeated fitting. Defaults to
1L.- return_fits
Logical. Should fitted models be attached as
attr(result, "fits")? Defaults toFALSE.- ...
Additional arguments forwarded to
fit_topic_model()in repeated-fit mode.
Value
An object of class c("nlp_topic_stability", "data.table") with one
row per reference-topic/run comparison. Topics from all non-reference runs
are matched to the first run. Columns include run metadata, matched topic
IDs, cosine similarity, per-topic stability, per-run stability, aggregate
stability, and model metadata.
Details
Topic labels are arbitrary across model runs. assess_topic_stability()
therefore extracts standardized topic-word weights with get_tww(), aligns
vocabularies, matches topics from each run to the first run, and reports
matched-topic cosine similarities.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(c(2, 1, 0, 0, 1, 1, 1, 0, 0, 1, 2, 1,
0, 0, 1, 2, 1, 0, 1, 1, 1, 2, 0, 1),
nrow = 6, byrow = TRUE),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)
assess_topic_stability(
dtm,
engine = "topicmodels",
model = "lda",
k = 2,
method = "Gibbs",
seeds = 1:2,
control = list(fit = list(iter = 50, burnin = 0, thin = 1))
)
#> <nlp_topic_stability>
#> K: 2
#> runs compared: 1
#> aggregate stability: 0.5196
fits <- lapply(1:2, function(s) {
fit_topic_model(
dtm,
engine = "topicmodels",
model = "lda",
k = 2,
method = "Gibbs",
control = list(fit = list(seed = s, iter = 50, burnin = 0, thin = 1))
)
})
assess_topic_stability(fits, seeds = 1:2)
#> <nlp_topic_stability>
#> K: 2
#> runs compared: 1
#> aggregate stability: 0.5196
