Extract Standardized Document Topic Weights

Extract DTW (document-topic weights) from a supported topic-model object and return a standardized data.table.

Usage

get_dtw(
  x,
  doc_data = NULL,
  docvars = FALSE,
  include_text = FALSE,
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

x: A supported topic-model object. This includes nlp_topic_fit, raw topicmodels fits, raw seededlda fits, and already standardized DTW tables.
doc_data: Optional document-data override. When supplied, this is used instead of any doc_data stored in x. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id.
docvars: Should stored or pre-existing document metadata be joined onto the returned DTW table? Defaults to FALSE.
include_text: Should a text column be attached when a text-bearing doc_data source is available? Defaults to FALSE. When TRUE but no text-bearing doc_data is available, the function emits a warning.
doc_id_col: Document-ID column name when doc_data is a data.frame or data.table. Defaults to "doc_id".
text_col: Text column name when doc_data is a data.frame or data.table. Defaults to "text".

Value

A data.table with:

doc_id
topic columns named Topic001, Topic002, ...
topic_max_id
topic_max_int
topic_max_value
stored docvars when docvars = TRUE
metadata columns from doc_data when available
text when include_text = TRUE and text is available

Columns are ordered as doc_id, document metadata, DTW output columns, and finally text when text is requested and available. For already standardized DTW-table inputs, non-topic metadata columns are treated as pre-existing document metadata and retained only when docvars = TRUE.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)

get_dtw(fit)
#>    doc_id  Topic001  Topic002 topic_max_id topic_max_int topic_max_value
#>    <char>     <num>     <num>       <char>         <int>           <num>
#> 1:   doc1 0.5000000 0.5000000     Topic001             1       0.5000000
#> 2:   doc2 0.5000000 0.5000000     Topic001             1       0.5000000
#> 3:   doc3 0.0000000 1.0000000     Topic002             2       1.0000000
#> 4:   doc4 0.3333333 0.6666667     Topic002             2       0.6666667
get_dtw(fit, docvars = TRUE)
#>    doc_id  Topic001  Topic002 topic_max_id topic_max_int topic_max_value
#>    <char>     <num>     <num>       <char>         <int>           <num>
#> 1:   doc1 0.5000000 0.5000000     Topic001             1       0.5000000
#> 2:   doc2 0.5000000 0.5000000     Topic001             1       0.5000000
#> 3:   doc3 0.0000000 1.0000000     Topic002             2       1.0000000
#> 4:   doc4 0.3333333 0.6666667     Topic002             2       0.6666667