Select the Number of Topics by Grid Search

Fit a topic model for each value in a grid of topic counts and evaluate each fit with evaluate_topic_model(). The result provides the information needed to compare candidate values of \(K\) on multiple quality metrics simultaneously.

Usage

select_k_topics(
  x,
  engine,
  model,
  k_grid = 5:15,
  metrics = c("coherence_npmi", "coherence_umass", "diversity", "exclusivity",
    "held_out_nll", "held_out_perplexity", "train_nll", "train_perplexity"),
  level = c("aggregate", "topic", "all"),
  control = list(),
  holdout = 0.2,
  ncores = 1L,
  seed = NULL,
  stability_seeds = NULL,
  stability_resampling = NULL,
  stability_ncores = ncores,
  stability_reuse_fit = FALSE,
  return_fits = FALSE,
  top_n = 10L,
  epsilon = 1e-12,
  method = NULL,
  ...
)

Arguments

x: Document-feature input for fitting. Accepted classes are dgCMatrix-class, dfm, and DocumentTermMatrix. A corpus is not accepted; convert to a document-feature matrix first with e.g. quanteda::dfm().
engine: Backend package. Forwarded to fit_topic_model().
model: Model family. Forwarded to fit_topic_model().
k_grid: Integer vector of topic counts \(K\) to evaluate. Defaults to 5:15.
metrics: Character vector of metrics to compute for each candidate \(K\). Defaults to all eight metrics supported by evaluate_topic_model().
level: Reporting level forwarded to evaluate_topic_model(). One of "aggregate" (default), "topic", or "all".
control: Named list of backend controls forwarded to fit_topic_model() for every candidate \(K\). Defaults to list().
holdout: Fraction of documents held out for held_out_nll and held_out_perplexity metrics. Must be in [0, 1). Defaults to 0.2. When holdout > 0, the remaining fraction is used as training for coherence and training likelihood metrics. When holdout = 0, coherence and training likelihood metrics are computed on the full fitting input and held-out metrics are marked unsupported because no held-out data is available. If none of "held_out_nll", "held_out_perplexity", "train_nll", "train_perplexity", "coherence_npmi", or "coherence_umass" is in metrics, the holdout split is skipped and the full x is used for fitting.
ncores: Number of parallel workers. Defaults to 1L (sequential). Each candidate \(K\) is fit independently, so parallelization scales linearly with length(k_grid). Uses "PSOCK" sockets; "FORK" is not used to preserve quanteda/C++ stability.
seed: Integer vector of length length(k_grid) used to seed each candidate \(K\)'s fit reproducibly. If a single integer is supplied it is expanded to a length-length(k_grid) vector starting from that value. NULL means no seeding. Defaults to NULL.
stability_seeds: Optional integer vector of seeds used to assess topic stability for each candidate \(K\). When NULL (default), no stability runs are performed and output is unchanged. When supplied, each \(K\) is refit across these seeds via assess_topic_stability().
stability_resampling: Optional resampling settings forwarded to assess_topic_stability(). Defaults to NULL.
stability_ncores: Integer. Number of workers used inside each stability assessment when the outer K grid is sequential. Defaults to ncores. Nested parallelism is avoided when ncores > 1.
stability_reuse_fit: Logical. When TRUE, each candidate \(K\)'s already-fitted evaluation model is reused as the first stability run (via the initial_fit argument of assess_topic_stability()), saving one refit per \(K\). Because the reference run is then the evaluation fit instead of a fresh fit under stability_seeds[1], stability numbers differ from the default. Defaults to FALSE. Requires stability_seeds.
return_fits: Logical. Should the fitted models be returned as an attribute of the result? Defaults to FALSE. Fits can be large; set TRUE only when you need to inspect or reuse them.
top_n: Integer. Forwarded to evaluate_topic_model(). Defaults to 10L.
epsilon: Numeric. Forwarded to evaluate_topic_model(). Defaults to 1e-12.
method: Fitting method forwarded to fit_topic_model(). Defaults to NULL.
...: Additional arguments forwarded to fit_topic_model().

Value

An object of class c("nlp_k_selection", "data.table") with columns:

k: Topic count \(K\).
metric: Metric name.
level: "aggregate" or "topic".
topic_id: Topic### for topic-level rows or NA for aggregate rows. The column is retained even when only aggregate metrics are requested so selection tables keep a stable long-format schema.
value: Numeric metric value.
supported: Logical; TRUE when the metric was computed.

If return_fits = TRUE the fitted models are stored in attr(result, "fits"), a named list with names "k<value>". If stability_seeds is supplied, aggregate stability rows are added with metric = "stability" and full per-topic stability outputs are stored in attr(result, "stability").

Details

Holdout split. When holdout > 0 and either predictive or coherence metrics are requested, x is split at the document level into a training shard (1 - holdout fraction) and a held-out shard (holdout fraction). The split is random but reproducible when seed is supplied. The training shard is passed to fit_topic_model() and to evaluate_topic_model() for coherence and training likelihood metrics; the held-out shard is passed to evaluate_topic_model() for held-out metrics. With holdout = 0, the full x is used for fitting, coherence, and training likelihood metrics, while held-out metrics are reported as unsupported.

A warning is issued when the number of documents is fewer than 50, because the holdout shard may be too small for stable predictive metrics.

Parallelisation. Uses "PSOCK" sockets. Each worker receives its own \(K\) value and seed and runs the full fit + evaluate cycle independently. The ncores = 1 path bypasses cluster creation entirely and runs sequentially.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 1, 1, 0,  0, 1, 2, 1,
             0, 0, 1, 2,  1, 0, 1, 1,  1, 2, 0, 1),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)

sel <- select_k_topics(
  dtm, engine = "text2vec", model = "lda",
  k_grid  = 2:3,
  metrics = c("diversity", "exclusivity"),
  holdout = 0,
  seed    = 42L,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)
print(sel)
#> <nlp_k_selection>
#>   K grid:  2, 3
#>   metrics: diversity, exclusivity
#> 
#>   Best K per metric (aggregate level):
#>     diversity            K = 2  (0.5)
#>     exclusivity          K = 2  (0.5)

Usage

Arguments

Value

Details

See also

Examples