Predict Document Topic Weights for New Data — predict_topic

Predict standardized document-topic weights (DTW) for new documents using a fitted object returned by fit_topic_model().

Usage

predict_topic_model(
  x,
  newdata,
  control = list(),
  docvars = FALSE,
  doc_data = NULL,
  include_text = FALSE,
  doc_id_col = "doc_id",
  text_col = "text"
)

Arguments

x

An object of class nlp_topic_fit.

newdata

New document-feature input. Supported classes are dgCMatrix-class, dfm, and DocumentTermMatrix.

control

A named list of backend-specific prediction arguments. Defaults to list().

text2vec: forwarded to model_object$transform()
topicmodels: forwarded to topicmodels::posterior()
seededlda: forwarded to the relevant textmodel_*() update call
topicmodels.etm: forwarded to stats::predict() with type = "topics"
stm: forwarded to stm::fitNewDocuments() for STM fits without prevalence covariates

docvars

Should available docvars from newdata be joined onto the returned DTW table? Defaults to FALSE.

doc_data

Optional document-data sidecar for metadata or text enrichment. Accepted inputs are a corpus, data.frame, or data.table keyed by doc_id.

include_text

Should a text column be attached when a text-bearing doc_data source is available? Defaults to FALSE.

doc_id_col

Document-ID column name when doc_data is a data.frame or data.table. Defaults to "doc_id".

text_col

Text column name when doc_data is a data.frame or data.table. Defaults to "text".

Value

A standardized DTW data.table with:

doc_id
topic columns named Topic001, Topic002, ...
topic_max_id
topic_max_int
topic_max_value
available docvars when docvars = TRUE
optional metadata/text joined from doc_data

Columns are ordered as doc_id, document metadata, DTW output columns, and finally text when text is requested and available.

Details

Prediction input is first aligned to the fitted vocabulary stored in x$vocab. Terms absent from the fitted vocabulary are dropped with a warning, missing fitted terms are added as zero columns, columns are reordered to fitted vocabulary order, and any documents that become empty after alignment are dropped with a warning.

STM prediction is supported only for fits without prevalence covariates. For STM prevalence-covariate fits, predict_topic_model() errors clearly rather than guessing new-document covariate handling.

Examples

dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(2, 1, 0, 0,
        1, 1, 1, 0,
        0, 1, 2, 1,
        0, 0, 1, 2),
      nrow = 4,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:4)

fit <- fit_topic_model(
  dtm,
  engine = "text2vec",
  model = "lda",
  k = 2,
  control = list(fit = list(n_iter = 25, progressbar = FALSE))
)

new_dtm <- methods::as(
  Matrix::Matrix(
    matrix(
      c(1, 0, 0, 1,
        0, 1, 1, 0),
      nrow = 2,
      byrow = TRUE
    ),
    sparse = TRUE
  ),
  "dgCMatrix"
)
rownames(new_dtm) <- c("new1", "new2")
colnames(new_dtm) <- paste0("term", 1:4)

predict_topic_model(fit, new_dtm)
#> INFO  [15:06:54.358] early stopping at 10 iteration
#>    doc_id Topic001 Topic002 topic_max_id topic_max_int topic_max_value
#>    <char>    <num>    <num>       <char>         <int>           <num>
#> 1:   new1      0.5      0.5     Topic001             1             0.5
#> 2:   new2      0.1      0.9     Topic002             2             0.9