Skip to contents

Extract DTW (document-topic weights) from a supported topic-model object and return a standardized data.table.

Usage

get_dtw(
  x,
  doc_data = NULL,
  docvars = FALSE,
  include_text = FALSE,
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

x

A supported topic-model object. This includes nlp_topic_fit, raw topicmodels fits, raw seededlda fits, and already standardized DTW tables.

doc_data

Optional document-data override. When supplied, this is used instead of any doc_data stored in x. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id.

docvars

Should stored or pre-existing document metadata be joined onto the returned DTW table? Defaults to FALSE.

include_text

Should a text column be attached when a text-bearing doc_data source is available? Defaults to FALSE. When TRUE but no text-bearing doc_data is available, the function emits a warning.

doc_id_col

Document-ID column name when doc_data is a data.frame or data.table. Defaults to "doc_id".

text_col

Text column name when doc_data is a data.frame or data.table. Defaults to "text".

Value

A data.table with:

  • doc_id

  • topic columns named Topic001, Topic002, ...

  • topic_max_id

  • topic_max_int

  • topic_max_value

  • stored docvars when docvars = TRUE

  • metadata columns from doc_data when available

  • text when include_text = TRUE and text is available

Columns are ordered as doc_id, document metadata, DTW output columns, and finally text when text is requested and available. For already standardized DTW-table inputs, non-topic metadata columns are treated as pre-existing document metadata and retained only when docvars = TRUE.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 1,
        1, 1, 0,
        0, 1, 1,
        1, 1, 1),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:3)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)

get_dtw(fit)
#>    doc_id  Topic001  Topic002 topic_max_id topic_max_int topic_max_value
#>    <char>     <num>     <num>       <char>         <int>           <num>
#> 1:   doc1 1.0000000 0.0000000     Topic001             1       1.0000000
#> 2:   doc2 0.5000000 0.5000000     Topic001             1       0.5000000
#> 3:   doc3 0.5000000 0.5000000     Topic001             1       0.5000000
#> 4:   doc4 0.6666667 0.3333333     Topic001             1       0.6666667
get_dtw(fit, docvars = TRUE)
#>    doc_id  Topic001  Topic002 topic_max_id topic_max_int topic_max_value
#>    <char>     <num>     <num>       <char>         <int>           <num>
#> 1:   doc1 1.0000000 0.0000000     Topic001             1       1.0000000
#> 2:   doc2 0.5000000 0.5000000     Topic001             1       0.5000000
#> 3:   doc3 0.5000000 0.5000000     Topic001             1       0.5000000
#> 4:   doc4 0.6666667 0.3333333     Topic001             1       0.6666667