Extract Representative Topic Candidates — get_representative

Identify representative document candidates from a topic model by assigning each document to its dominant topic and banding documents within topic by their dominant DTW values.

Usage

get_representative_candidates(
  x,
  doc_data = NULL,
  topics = NULL,
  docvars = FALSE,
  include_text = FALSE,
  top_n = NULL,
  quantile_probs = c(0.25, 0.5, 0.75),
  labels = c("VLOW", "LOW", "HIGH", "VHIGH"),
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

x: A supported topic-model object. This includes nlp_topic_fit, raw topicmodels fits, raw seededlda fits, and already standardized DTW tables.
doc_data: Optional document-data override. When supplied, this is used instead of any doc_data stored in x. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id.
topics: Optional topic filter. May be supplied as numeric indices or Topic### identifiers. Filtering occurs after dominant-topic assignment.
docvars: Should stored or pre-existing document metadata be joined onto the returned DTW table? Defaults to FALSE.
include_text: Should a text column be attached when a text-bearing doc_data source is available? Defaults to FALSE. When TRUE but no text-bearing doc_data is available, the function emits a warning.
top_n: Optional integer. When supplied, keep only the top_n highest-ranked documents within each dominant topic. Defaults to NULL, which returns every document. Useful for extracting a handful of exemplar documents per topic on large corpora.
quantile_probs: Numeric vector of cumulative probabilities used to form candidate bands within each dominant topic. Defaults to quartiles: c(0.25, 0.50, 0.75).
labels: Labels used for the candidate bands. Must have length length(quantile_probs) + 1L. Defaults to c("VLOW", "LOW", "HIGH", "VHIGH").
doc_id_col: Document-ID column name when doc_data is a data.frame or data.table. Defaults to "doc_id".
text_col: Text column name when doc_data is a data.frame or data.table. Defaults to "text".

Value

A data.table with one row per document and these core columns:

doc_id
topic_max_id
topic_max_int
topic_max_value
candidate_band
topic_rank

Stored docvars are included when docvars = TRUE. Metadata columns from doc_data and optional text are included when available. When docvars = FALSE, columns that match stored docvar names are omitted even if they are also present in doc_data. Columns are ordered as doc_id, document metadata, representative-candidate output columns, and finally text when text is requested and available.

Details

Candidate bands are computed within each dominant topic, not globally across the corpus. When within-topic quantile cut points collapse because of small groups or tied values, the function falls back to deterministic rank-based banding.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

metadata <- data.table::data.table(
  doc_id = rownames(dtm),
  year = 2020:2023,
  text = c("alpha beta", "beta gamma", "gamma delta", "alpha delta")
)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  doc_data = metadata,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)

get_representative_candidates(fit, include_text = TRUE)
#>    doc_id  year topic_max_id topic_max_int topic_max_value candidate_band
#>    <char> <int>       <char>         <int>           <num>         <char>
#> 1:   doc1  2020     Topic001             1       0.5000000           VLOW
#> 2:   doc3  2022     Topic001             1       0.5000000           HIGH
#> 3:   doc2  2021     Topic002             2       1.0000000          VHIGH
#> 4:   doc4  2023     Topic002             2       0.6666667           VLOW
#>    topic_rank        text
#>         <int>      <char>
#> 1:          1  alpha beta
#> 2:          2 gamma delta
#> 3:          1  beta gamma
#> 4:          2 alpha delta