
Extract Representative Topic Candidates
Source:R/topic_model_api.R
get_representative_candidates.RdIdentify representative document candidates from a topic model by assigning each document to its dominant topic and banding documents within topic by their dominant DTW values.
Arguments
- x
A supported topic-model object. This includes
nlp_topic_fit, rawtopicmodelsfits, rawseededldafits, and already standardized DTW tables.- doc_data
Optional document-data override. When supplied, this is used instead of any
doc_datastored inx. Accepted inputs are a corpus, data.frame, or data.table keyed bydoc_id.- topics
Optional topic filter. May be supplied as numeric indices or
Topic###identifiers. Filtering occurs after dominant-topic assignment.- docvars
Should stored or pre-existing document metadata be joined onto the returned DTW table? Defaults to
FALSE.- include_text
Should a
textcolumn be attached when a text-bearingdoc_datasource is available? Defaults toFALSE. WhenTRUEbut no text-bearingdoc_datais available, the function emits a warning.- quantile_probs
Numeric vector of cumulative probabilities used to form candidate bands within each dominant topic. Defaults to quartiles:
c(0.25, 0.50, 0.75).- labels
Labels used for the candidate bands. Must have length
length(quantile_probs) + 1L. Defaults toc("VLOW", "LOW", "HIGH", "VHIGH").- doc_id_col
Document-ID column name when
doc_datais a data.frame or data.table. Defaults to"doc_id".- text_col
Text column name when
doc_datais a data.frame or data.table. Defaults to"text".
Value
A data.table with one row per document and these core columns:
doc_idtopic_max_idtopic_max_inttopic_max_valuecandidate_bandtopic_rank
Stored docvars are included when docvars = TRUE. Metadata columns from
doc_data and optional text are included when available.
When docvars = FALSE, columns that match stored docvar names are omitted
even if they are also present in doc_data.
Columns are ordered as doc_id, document metadata, representative-candidate
output columns, and finally text when text is requested and available.
Details
Candidate bands are computed within each dominant topic, not globally across the corpus. When within-topic quantile cut points collapse because of small groups or tied values, the function falls back to deterministic rank-based banding.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(
c(1, 0, 1,
1, 1, 0,
0, 1, 1,
1, 1, 1),
nrow = 4,
byrow = TRUE
),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)
metadata <- data.table::data.table(
doc_id = rownames(dtm),
year = 2020:2023,
text = c("alpha beta", "beta gamma", "gamma delta", "alpha delta")
)
fit <- fit_topic_model(
dtm,
engine = "text2vec",
model = "lda",
k = 2,
doc_data = metadata,
control = list(fit = list(n_iter = 25, progressbar = FALSE))
)
get_representative_candidates(fit, include_text = TRUE)
#> doc_id year Topic001 Topic002 topic_max_id topic_max_int topic_max_value
#> <char> <int> <num> <num> <char> <int> <num>
#> 1: doc3 2022 1.0000000 0.0000000 Topic001 1 1.0000000
#> 2: doc1 2020 0.5000000 0.5000000 Topic001 1 0.5000000
#> 3: doc2 2021 0.0000000 1.0000000 Topic002 2 1.0000000
#> 4: doc4 2023 0.3333333 0.6666667 Topic002 2 0.6666667
#> candidate_band topic_rank text
#> <char> <int> <char>
#> 1: VHIGH 1 gamma delta
#> 2: VLOW 2 alpha beta
#> 3: VHIGH 1 beta gamma
#> 4: VLOW 2 alpha delta