Build a one-row-per-topic interpretation table for an STM fit, combining NLPstudio's engine-agnostic topic summary with STM-native label columns.
Usage
summarize_stm_topics(
fit,
training = NULL,
doc_data = NULL,
top_n = 10L,
representative_n = 3L,
include_text = FALSE,
docvars = FALSE,
label_n = 7L,
label_types = c("prob", "frex", "lift", "score"),
frexweight = 0.5,
include_sage = FALSE,
doc_id_col = "doc_id",
text_col = "text"
)Arguments
- fit
An STM
nlp_topic_fitreturned byfit_topic_model()or a raw stmSTMobject without content covariates.- training
Optional training document-feature matrix forwarded to
summarize_topics().- doc_data
Optional document metadata or text source forwarded to
summarize_topics().- top_n
Integer. Number of probability top terms used by the generic topic summary. Defaults to
10L.- representative_n
Integer. Number of representative documents retained per topic. Defaults to
3L.- include_text
Logical. Should representative text be included when available? Defaults to
FALSE.- docvars
Logical. Should stored document variables be available for representative selection output? Defaults to
FALSE.- label_n
Integer. Number of STM-native label terms per label type. Defaults to
7L.- label_types
Character vector of STM label families. Valid values are
"prob","frex","lift", and"score".- frexweight
Numeric value in
[0, 1]forwarded tostm::labelTopics(). Defaults to0.5.- include_sage
Logical. Should SAGE marginal label columns be included? Defaults to
FALSE.- doc_id_col
Document-ID column name when
doc_datais tabular. Defaults to"doc_id".- text_col
Text column name when
doc_datais tabular. Defaults to"text".
Value
A data.table with one row per STM topic.
Details
summarize_stm_topics() keeps summarize_topics() as the generic summary
engine and adds collapsed STM-native label columns such as
stm_prob_terms, stm_frex_terms, stm_lift_terms, and
stm_score_terms. When include_sage = TRUE, corresponding
stm_sage_*_terms columns are added.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(c(2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 2, 1,
0, 0, 1, 2, 2, 1, 0, 0, 0, 0, 1, 2),
nrow = 6, byrow = TRUE),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- c("growth", "profit", "risk", "loss")
fit <- fit_topic_model(
dtm,
engine = "stm",
model = "stm",
k = 2,
control = list(fit = list(seed = 1, max.em.its = 5, verbose = FALSE))
)
#> Warning: K=2 is equivalent to a unidimensional scaling model which you may prefer.
summarize_stm_topics(fit, training = dtm, top_n = 3, label_n = 3)
#> topic_id topic_int top_terms top_term_probabilities
#> <char> <int> <char> <char>
#> 1: Topic001 1 growth, profit, loss 0.555555, 0.444444, 8.93273e-08
#> 2: Topic002 2 loss, risk, growth 0.555555, 0.444444, 8.93273e-08
#> prevalence coherence_npmi coherence_umass diversity exclusivity
#> <num> <num> <num> <num> <num>
#> 1: 0.5 -0.2998856 -19.15309 0.6666667 0.6666667
#> 2: 0.5 -0.2998856 -19.15309 0.6666667 0.6666667
#> representative_doc_ids representative_documents stm_frex_terms
#> <char> <list> <char>
#> 1: doc2, doc1, doc5 <data.table[3x1]> growth, profit, loss
#> 2: doc3, doc4, doc6 <data.table[3x1]> risk, loss, growth
#> stm_lift_terms stm_prob_terms stm_score_terms
#> <char> <char> <char>
#> 1: profit, growth, loss growth, profit, loss profit, growth, risk
#> 2: risk, loss, growth loss, risk, growth risk, loss, profit
