Summarize Topics for Interpretation — summarize

Build a compact one-row-per-topic interpretation table from an nlp_topic_fit. The table combines top terms, prevalence, available evaluation metrics, and representative documents.

Usage

summarize_topics(
  fit,
  training = NULL,
  doc_data = NULL,
  top_n = 10L,
  representative_n = 3L,
  include_text = FALSE,
  docvars = FALSE,
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

fit: An nlp_topic_fit object returned by fit_topic_model().
training: Optional training document-feature matrix. When supplied, coherence metrics are included.
doc_data: Optional document metadata or text source forwarded to get_dtw() and get_representative_candidates().
top_n: Integer. Number of top terms per topic. Defaults to 10L.
representative_n: Integer. Number of representative documents to retain per topic. Defaults to 3L.
include_text: Logical. Should representative text be included when available? Defaults to FALSE.
docvars: Logical. Should stored document variables be available for representative selection output? Defaults to FALSE.
doc_id_col: Document-ID column name when doc_data is tabular. Defaults to "doc_id".
text_col: Text column name when doc_data is tabular. Defaults to "text".

Value

A data.table with one row per topic.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(c(2, 1, 0, 0,  1, 1, 1, 0,  0, 1, 2, 1,
             0, 0, 1, 2,  1, 0, 1, 1,  1, 2, 0, 1),
           nrow = 6, byrow = TRUE),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:6)
colnames(dtm) <- paste0("term", 1:4)
fit <- fit_topic_model(
  dtm,
  engine = "topicmodels",
  model = "lda",
  k = 2,
  method = "Gibbs",
  control = list(fit = list(seed = 1, iter = 50, burnin = 0, thin = 1))
)
summarize_topics(fit, training = dtm, top_n = 3)
#>    topic_id topic_int           top_terms         top_term_probabilities
#>      <char>     <int>              <char>                         <char>
#> 1: Topic001         1 term1, term4, term2 0.490385, 0.490385, 0.00961538
#> 2: Topic002         2 term2, term3, term1 0.490385, 0.490385, 0.00961538
#>    prevalence coherence_npmi coherence_umass diversity exclusivity
#>         <num>          <num>           <num>     <num>       <num>
#> 1:  0.5000582     -0.1179313      -0.5579921 0.6666667   0.6602564
#> 2:  0.4999418     -0.1179313      -0.5579921 0.6666667   0.6602564
#>    representative_doc_ids representative_documents
#>                    <char>                   <list>
#> 1:       doc1, doc4, doc5        <data.table[3x1]>
#> 2:             doc3, doc2        <data.table[2x1]>