Summarize STM Topics — summarize_stm

Build a one-row-per-topic interpretation table for an STM fit, combining NLPstudio's engine-agnostic topic summary with STM-native label columns.

Usage

summarize_stm_topics(
  fit,
  training = NULL,
  doc_data = NULL,
  top_n = 10L,
  representative_n = 3L,
  include_text = FALSE,
  docvars = FALSE,
  label_n = 7L,
  label_types = c("prob", "frex", "lift", "score"),
  frexweight = 0.5,
  include_sage = FALSE,
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

fit: An STM nlp_topic_fit returned by fit_topic_model() or a raw stm STM object without content covariates.
training: Optional training document-feature matrix forwarded to summarize_topics().
doc_data: Optional document metadata or text source forwarded to summarize_topics().
top_n: Integer. Number of probability top terms used by the generic topic summary. Defaults to 10L.
representative_n: Integer. Number of representative documents retained per topic. Defaults to 3L.
include_text: Logical. Should representative text be included when available? Defaults to FALSE.
docvars: Logical. Should stored document variables be available for representative selection output? Defaults to FALSE.
label_n: Integer. Number of STM-native label terms per label type. Defaults to 7L.
label_types: Character vector of STM label families. Valid values are "prob", "frex", "lift", and "score".
frexweight: Numeric value in [0, 1] forwarded to stm::labelTopics(). Defaults to 0.5.
include_sage: Logical. Should SAGE marginal label columns be included? Defaults to FALSE.
doc_id_col: Document-ID column name when doc_data is tabular. Defaults to "doc_id".
text_col: Text column name when doc_data is tabular. Defaults to "text".

Value

A data.table with one row per STM topic.

Details

summarize_stm_topics() keeps summarize_topics() as the generic summary engine and adds collapsed STM-native label columns such as stm_prob_terms, stm_frex_terms, stm_lift_terms, and stm_score_terms. When include_sage = TRUE, corresponding stm_sage_*_terms columns are added.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 2, 0, 0,  0, 0, 2, 1,
             0, 0, 1, 2,  2, 1, 0, 0,  0, 0, 1, 2),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- c("growth", "profit", "risk", "loss")
fit <- fit_topic_model(
  dtm,
  engine = "stm",
  model = "stm",
  k = 2,
  control = list(fit = list(seed = 1, max.em.its = 5, verbose = FALSE))
)
#> Warning: K=2 is equivalent to a unidimensional scaling model which you may prefer.
summarize_stm_topics(fit, training = dtm, top_n = 3, label_n = 3)
#>    topic_id topic_int            top_terms          top_term_probabilities
#>      <char>     <int>               <char>                          <char>
#> 1: Topic001         1 growth, profit, loss 0.555555, 0.444444, 8.93273e-08
#> 2: Topic002         2   loss, risk, growth 0.555555, 0.444444, 8.93273e-08
#>    prevalence coherence_npmi coherence_umass diversity exclusivity
#>         <num>          <num>           <num>     <num>       <num>
#> 1:        0.5     -0.2998856       -19.15309 0.6666667   0.6666667
#> 2:        0.5     -0.2998856       -19.15309 0.6666667   0.6666667
#>    representative_doc_ids representative_documents       stm_frex_terms
#>                    <char>                   <list>               <char>
#> 1:       doc2, doc1, doc5        <data.table[3x1]> growth, profit, loss
#> 2:       doc3, doc4, doc6        <data.table[3x1]>   risk, loss, growth
#>          stm_lift_terms       stm_prob_terms      stm_score_terms
#>                  <char>               <char>               <char>
#> 1: profit, growth, loss growth, profit, loss profit, growth, risk
#> 2:   risk, loss, growth   loss, risk, growth   risk, loss, profit