Fit a topic model for each value in a grid of topic counts and evaluate each
fit with evaluate_topic_model(). The result provides the information needed
to compare candidate values of \(K\) on multiple quality metrics
simultaneously.
Usage
select_k_topics(
x,
engine,
model,
k_grid = 5:15,
metrics = c("coherence_npmi", "coherence_umass", "diversity", "exclusivity",
"held_out_nll", "held_out_perplexity", "train_nll", "train_perplexity"),
level = c("aggregate", "topic", "all"),
control = list(),
holdout = 0.2,
ncores = 1L,
seed = NULL,
stability_seeds = NULL,
stability_resampling = NULL,
stability_ncores = ncores,
return_fits = FALSE,
top_n = 10L,
epsilon = 1e-12,
method = NULL,
...
)Arguments
- x
Document-feature input for fitting. Accepted classes are dgCMatrix-class, dfm, and
DocumentTermMatrix. A corpus is not accepted; convert to a document-feature matrix first with e.g.quanteda::dfm().- engine
Backend package. Forwarded to
fit_topic_model().- model
Model family. Forwarded to
fit_topic_model().- k_grid
Integer vector of topic counts \(K\) to evaluate. Defaults to
5:15.- metrics
Character vector of metrics to compute for each candidate \(K\). Defaults to all eight metrics supported by
evaluate_topic_model().- level
Reporting level forwarded to
evaluate_topic_model(). One of"aggregate"(default),"topic", or"all".- control
Named list of backend controls forwarded to
fit_topic_model()for every candidate \(K\). Defaults tolist().- holdout
Fraction of documents held out for
held_out_nllandheld_out_perplexitymetrics. Must be in[0, 1). Defaults to0.2. Whenholdout > 0, the remaining fraction is used astrainingfor coherence and training likelihood metrics. Whenholdout = 0, coherence and training likelihood metrics are computed on the full fitting input and held-out metrics are marked unsupported because no held-out data is available. If none of"held_out_nll","held_out_perplexity","train_nll","train_perplexity","coherence_npmi", or"coherence_umass"is inmetrics, the holdout split is skipped and the fullxis used for fitting.- ncores
Number of parallel workers. Defaults to
1L(sequential). Each candidate \(K\) is fit independently, so parallelization scales linearly withlength(k_grid). Uses"PSOCK"sockets;"FORK"is not used to preserve quanteda/C++ stability.- seed
Integer vector of length
length(k_grid)used to seed each candidate \(K\)'s fit reproducibly. If a single integer is supplied it is expanded to a length-length(k_grid)vector starting from that value.NULLmeans no seeding. Defaults toNULL.- stability_seeds
Optional integer vector of seeds used to assess topic stability for each candidate \(K\). When
NULL(default), no stability runs are performed and output is unchanged. When supplied, each \(K\) is refit across these seeds viaassess_topic_stability().- stability_resampling
Optional resampling settings forwarded to
assess_topic_stability(). Defaults toNULL.- stability_ncores
Integer. Number of workers used inside each stability assessment when the outer K grid is sequential. Defaults to
ncores. Nested parallelism is avoided whenncores > 1.- return_fits
Logical. Should the fitted models be returned as an attribute of the result? Defaults to
FALSE. Fits can be large; setTRUEonly when you need to inspect or reuse them.- top_n
Integer. Forwarded to
evaluate_topic_model(). Defaults to10L.- epsilon
Numeric. Forwarded to
evaluate_topic_model(). Defaults to1e-12.- method
Fitting method forwarded to
fit_topic_model(). Defaults toNULL.- ...
Additional arguments forwarded to
fit_topic_model().
Value
An object of class c("nlp_k_selection", "data.table") with
columns:
kTopic count \(K\).
metricMetric name.
level"aggregate"or"topic".topic_idTopic###for topic-level rows orNAfor aggregate rows. The column is retained even when only aggregate metrics are requested so selection tables keep a stable long-format schema.valueNumeric metric value.
supportedLogical;
TRUEwhen the metric was computed.
If return_fits = TRUE the fitted models are stored in
attr(result, "fits"), a named list with names "k<value>".
If stability_seeds is supplied, aggregate stability rows are added with
metric = "stability" and full per-topic stability outputs are stored in
attr(result, "stability").
Details
Holdout split. When holdout > 0 and either predictive or coherence
metrics are requested, x is split at the document level into a training
shard (1 - holdout fraction) and a held-out shard (holdout fraction).
The split is random but reproducible when seed is supplied. The training
shard is passed to fit_topic_model() and to evaluate_topic_model() for
coherence and training likelihood metrics; the held-out shard is passed to
evaluate_topic_model() for held-out metrics. With holdout = 0, the full
x is used for fitting, coherence, and training likelihood metrics, while
held-out metrics are reported as unsupported.
A warning is issued when the number of documents is fewer than 50, because the holdout shard may be too small for stable predictive metrics.
Parallelisation. Uses "PSOCK" sockets. Each worker receives its own
\(K\) value and seed and runs the full fit + evaluate cycle independently.
The ncores = 1 path bypasses cluster creation entirely and runs
sequentially.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(c(2, 1, 0, 0, 1, 1, 1, 0, 0, 1, 2, 1,
0, 0, 1, 2, 1, 0, 1, 1, 1, 2, 0, 1),
nrow = 6, byrow = TRUE),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)
sel <- select_k_topics(
dtm, engine = "text2vec", model = "lda",
k_grid = 2:3,
metrics = c("diversity", "exclusivity"),
holdout = 0,
seed = 42L,
control = list(fit = list(n_iter = 25, progressbar = FALSE))
)
print(sel)
#> <nlp_k_selection>
#> K grid: 2, 3
#> metrics: diversity, exclusivity
#>
#> Best K per metric (aggregate level):
#> diversity K = 2 (0.5)
#> exclusivity K = 2 (0.5)
