Skip to contents

Identify representative document candidates from a topic model by assigning each document to its dominant topic and banding documents within topic by their dominant DTW values.

Usage

get_representative_candidates(
  x,
  doc_data = NULL,
  topics = NULL,
  docvars = FALSE,
  include_text = FALSE,
  quantile_probs = c(0.25, 0.5, 0.75),
  labels = c("VLOW", "LOW", "HIGH", "VHIGH"),
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

x

A supported topic-model object. This includes nlp_topic_fit, raw topicmodels fits, raw seededlda fits, and already standardized DTW tables.

doc_data

Optional document-data override. When supplied, this is used instead of any doc_data stored in x. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id.

topics

Optional topic filter. May be supplied as numeric indices or Topic### identifiers. Filtering occurs after dominant-topic assignment.

docvars

Should stored or pre-existing document metadata be joined onto the returned DTW table? Defaults to FALSE.

include_text

Should a text column be attached when a text-bearing doc_data source is available? Defaults to FALSE. When TRUE but no text-bearing doc_data is available, the function emits a warning.

quantile_probs

Numeric vector of cumulative probabilities used to form candidate bands within each dominant topic. Defaults to quartiles: c(0.25, 0.50, 0.75).

labels

Labels used for the candidate bands. Must have length length(quantile_probs) + 1L. Defaults to c("VLOW", "LOW", "HIGH", "VHIGH").

doc_id_col

Document-ID column name when doc_data is a data.frame or data.table. Defaults to "doc_id".

text_col

Text column name when doc_data is a data.frame or data.table. Defaults to "text".

Value

A data.table with one row per document and these core columns:

  • doc_id

  • topic_max_id

  • topic_max_int

  • topic_max_value

  • candidate_band

  • topic_rank

Stored docvars are included when docvars = TRUE. Metadata columns from doc_data and optional text are included when available. When docvars = FALSE, columns that match stored docvar names are omitted even if they are also present in doc_data. Columns are ordered as doc_id, document metadata, representative-candidate output columns, and finally text when text is requested and available.

Details

Candidate bands are computed within each dominant topic, not globally across the corpus. When within-topic quantile cut points collapse because of small groups or tied values, the function falls back to deterministic rank-based banding.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

metadata <- data.table::data.table(
  doc_id = rownames(dtm),
  year = 2020:2023,
  text = c("alpha beta", "beta gamma", "gamma delta", "alpha delta")
)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  doc_data = metadata,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)

get_representative_candidates(fit, include_text = TRUE)
#>    doc_id  year  Topic001  Topic002 topic_max_id topic_max_int topic_max_value
#>    <char> <int>     <num>     <num>       <char>         <int>           <num>
#> 1:   doc3  2022 1.0000000 0.0000000     Topic001             1       1.0000000
#> 2:   doc1  2020 0.5000000 0.5000000     Topic001             1       0.5000000
#> 3:   doc2  2021 0.0000000 1.0000000     Topic002             2       1.0000000
#> 4:   doc4  2023 0.3333333 0.6666667     Topic002             2       0.6666667
#>    candidate_band topic_rank        text
#>            <char>      <int>      <char>
#> 1:          VHIGH          1 gamma delta
#> 2:           VLOW          2  alpha beta
#> 3:          VHIGH          1  beta gamma
#> 4:           VLOW          2 alpha delta