Skip to contents

Fit a topic model for each value in a grid of topic counts and evaluate each fit with evaluate_topic_model(). The result provides the information needed to compare candidate values of \(K\) on multiple quality metrics simultaneously.

Usage

select_k_topics(
  x,
  engine,
  model,
  k_grid = 5:15,
  metrics = c("coherence_npmi", "coherence_umass", "diversity", "exclusivity",
    "held_out_nll", "held_out_perplexity", "train_nll", "train_perplexity"),
  level = c("aggregate", "topic", "all"),
  control = list(),
  holdout = 0.2,
  ncores = 1L,
  seed = NULL,
  stability_seeds = NULL,
  stability_resampling = NULL,
  stability_ncores = ncores,
  return_fits = FALSE,
  top_n = 10L,
  epsilon = 1e-12,
  method = NULL,
  ...
)

Arguments

x

Document-feature input for fitting. Accepted classes are dgCMatrix-class, dfm, and DocumentTermMatrix. A corpus is not accepted; convert to a document-feature matrix first with e.g. quanteda::dfm().

engine

Backend package. Forwarded to fit_topic_model().

model

Model family. Forwarded to fit_topic_model().

k_grid

Integer vector of topic counts \(K\) to evaluate. Defaults to 5:15.

metrics

Character vector of metrics to compute for each candidate \(K\). Defaults to all eight metrics supported by evaluate_topic_model().

level

Reporting level forwarded to evaluate_topic_model(). One of "aggregate" (default), "topic", or "all".

control

Named list of backend controls forwarded to fit_topic_model() for every candidate \(K\). Defaults to list().

holdout

Fraction of documents held out for held_out_nll and held_out_perplexity metrics. Must be in [0, 1). Defaults to 0.2. When holdout > 0, the remaining fraction is used as training for coherence and training likelihood metrics. When holdout = 0, coherence and training likelihood metrics are computed on the full fitting input and held-out metrics are marked unsupported because no held-out data is available. If none of "held_out_nll", "held_out_perplexity", "train_nll", "train_perplexity", "coherence_npmi", or "coherence_umass" is in metrics, the holdout split is skipped and the full x is used for fitting.

ncores

Number of parallel workers. Defaults to 1L (sequential). Each candidate \(K\) is fit independently, so parallelization scales linearly with length(k_grid). Uses "PSOCK" sockets; "FORK" is not used to preserve quanteda/C++ stability.

seed

Integer vector of length length(k_grid) used to seed each candidate \(K\)'s fit reproducibly. If a single integer is supplied it is expanded to a length-length(k_grid) vector starting from that value. NULL means no seeding. Defaults to NULL.

stability_seeds

Optional integer vector of seeds used to assess topic stability for each candidate \(K\). When NULL (default), no stability runs are performed and output is unchanged. When supplied, each \(K\) is refit across these seeds via assess_topic_stability().

stability_resampling

Optional resampling settings forwarded to assess_topic_stability(). Defaults to NULL.

stability_ncores

Integer. Number of workers used inside each stability assessment when the outer K grid is sequential. Defaults to ncores. Nested parallelism is avoided when ncores > 1.

return_fits

Logical. Should the fitted models be returned as an attribute of the result? Defaults to FALSE. Fits can be large; set TRUE only when you need to inspect or reuse them.

top_n

Integer. Forwarded to evaluate_topic_model(). Defaults to 10L.

epsilon

Numeric. Forwarded to evaluate_topic_model(). Defaults to 1e-12.

method

Fitting method forwarded to fit_topic_model(). Defaults to NULL.

...

Additional arguments forwarded to fit_topic_model().

Value

An object of class c("nlp_k_selection", "data.table") with columns:

k

Topic count \(K\).

metric

Metric name.

level

"aggregate" or "topic".

topic_id

Topic### for topic-level rows or NA for aggregate rows. The column is retained even when only aggregate metrics are requested so selection tables keep a stable long-format schema.

value

Numeric metric value.

supported

Logical; TRUE when the metric was computed.

If return_fits = TRUE the fitted models are stored in attr(result, "fits"), a named list with names "k<value>". If stability_seeds is supplied, aggregate stability rows are added with metric = "stability" and full per-topic stability outputs are stored in attr(result, "stability").

Details

Holdout split. When holdout > 0 and either predictive or coherence metrics are requested, x is split at the document level into a training shard (1 - holdout fraction) and a held-out shard (holdout fraction). The split is random but reproducible when seed is supplied. The training shard is passed to fit_topic_model() and to evaluate_topic_model() for coherence and training likelihood metrics; the held-out shard is passed to evaluate_topic_model() for held-out metrics. With holdout = 0, the full x is used for fitting, coherence, and training likelihood metrics, while held-out metrics are reported as unsupported.

A warning is issued when the number of documents is fewer than 50, because the holdout shard may be too small for stable predictive metrics.

Parallelisation. Uses "PSOCK" sockets. Each worker receives its own \(K\) value and seed and runs the full fit + evaluate cycle independently. The ncores = 1 path bypasses cluster creation entirely and runs sequentially.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 1, 1, 0,  0, 1, 2, 1,
             0, 0, 1, 2,  1, 0, 1, 1,  1, 2, 0, 1),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)

sel <- select_k_topics(
  dtm, engine = "text2vec", model = "lda",
  k_grid  = 2:3,
  metrics = c("diversity", "exclusivity"),
  holdout = 0,
  seed    = 42L,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)
print(sel)
#> <nlp_k_selection>
#>   K grid:  2, 3
#>   metrics: diversity, exclusivity
#> 
#>   Best K per metric (aggregate level):
#>     diversity            K = 2  (0.5)
#>     exclusivity          K = 2  (0.5)