Skip to contents

Return STM-native topic labels as a standardized long table. The helper wraps stm::labelTopics() and, optionally, stm::sageLabels() while keeping NLPstudio's canonical Topic### identifiers.

Usage

get_stm_topic_labels(
  x,
  n = 7L,
  topics = NULL,
  label_types = c("prob", "frex", "lift", "score"),
  frexweight = 0.5,
  include_sage = FALSE
)

Arguments

x

An STM nlp_topic_fit returned by fit_topic_model() or a raw stm STM object without content covariates.

n

Integer. Number of terms per label type. Defaults to 7L.

topics

Optional topic filter supplied as numeric topic indices or Topic### identifiers.

label_types

Character vector of STM label families to return. Valid values are "prob", "frex", "lift", and "score".

frexweight

Numeric value in [0, 1] forwarded to stm::labelTopics() for FREX labels. Defaults to 0.5.

include_sage

Logical. Should stm::sageLabels() marginal labels also be included? Defaults to FALSE.

Value

A data.table with columns topic_id, topic_int, source, label_type, rank, and term.

Details

This function is STM-specific. It is meant to complement the engine-agnostic get_top_terms() accessor when users want labels based on STM's own probability, FREX, lift, score, and optional SAGE calculations.

STM content-covariate models are not supported because they imply covariate-specific topic-word distributions, while NLPstudio currently standardizes one TWW matrix per fit.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 2, 0, 0,  0, 0, 2, 1,
             0, 0, 1, 2,  2, 1, 0, 0,  0, 0, 1, 2),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- c("growth", "profit", "risk", "loss")
fit <- fit_topic_model(
  dtm,
  engine = "stm",
  model = "stm",
  k = 2,
  control = list(fit = list(seed = 1, max.em.its = 5, verbose = FALSE))
)
#> Warning: K=2 is equivalent to a unidimensional scaling model which you may prefer.
get_stm_topic_labels(fit, n = 3)
#>     topic_id topic_int      source label_type  rank   term
#>       <char>     <int>      <char>     <char> <int> <char>
#>  1: Topic001         1 labelTopics       frex     1 growth
#>  2: Topic001         1 labelTopics       frex     2 profit
#>  3: Topic001         1 labelTopics       frex     3   loss
#>  4: Topic001         1 labelTopics       lift     1 profit
#>  5: Topic001         1 labelTopics       lift     2 growth
#>  6: Topic001         1 labelTopics       lift     3   loss
#>  7: Topic001         1 labelTopics       prob     1 growth
#>  8: Topic001         1 labelTopics       prob     2 profit
#>  9: Topic001         1 labelTopics       prob     3   loss
#> 10: Topic001         1 labelTopics      score     1 profit
#> 11: Topic001         1 labelTopics      score     2 growth
#> 12: Topic001         1 labelTopics      score     3   risk
#> 13: Topic002         2 labelTopics       frex     1   risk
#> 14: Topic002         2 labelTopics       frex     2   loss
#> 15: Topic002         2 labelTopics       frex     3 growth
#> 16: Topic002         2 labelTopics       lift     1   risk
#> 17: Topic002         2 labelTopics       lift     2   loss
#> 18: Topic002         2 labelTopics       lift     3 growth
#> 19: Topic002         2 labelTopics       prob     1   loss
#> 20: Topic002         2 labelTopics       prob     2   risk
#> 21: Topic002         2 labelTopics       prob     3 growth
#> 22: Topic002         2 labelTopics      score     1   risk
#> 23: Topic002         2 labelTopics      score     2   loss
#> 24: Topic002         2 labelTopics      score     3 profit
#>     topic_id topic_int      source label_type  rank   term
#>       <char>     <int>      <char>     <char> <int> <char>