Predict standardized document-topic weights (DTW) for new documents using a
fitted object returned by fit_topic_model().
Usage
predict_topic_model(
x,
newdata,
control = list(),
docvars = FALSE,
doc_data = NULL,
include_text = FALSE,
doc_id_col = "doc_id",
text_col = "text"
)Arguments
- x
An object of class
nlp_topic_fit.- newdata
New document-feature input. Supported classes are dgCMatrix-class, dfm, and
DocumentTermMatrix.- control
A named list of backend-specific prediction arguments. Defaults to
list().text2vec: forwarded tomodel_object$transform()topicmodels: forwarded totopicmodels::posterior()seededlda: forwarded to the relevanttextmodel_*()update calltopicmodels.etm: forwarded tostats::predict()withtype = "topics"stm: forwarded tostm::fitNewDocuments()for STM fits without prevalence covariates
- docvars
Should available docvars from
newdatabe joined onto the returned DTW table? Defaults toFALSE.- doc_data
Optional document-data sidecar for metadata or text enrichment. Accepted inputs are a corpus, data.frame, or data.table keyed by
doc_id.- include_text
Should a
textcolumn be attached when a text-bearingdoc_datasource is available? Defaults toFALSE.- doc_id_col
Document-ID column name when
doc_datais a data.frame or data.table. Defaults to"doc_id".- text_col
Text column name when
doc_datais a data.frame or data.table. Defaults to"text".
Value
A standardized DTW data.table with:
doc_idtopic columns named
Topic001,Topic002, ...topic_max_idtopic_max_inttopic_max_valueavailable docvars when
docvars = TRUEoptional metadata/text joined from
doc_data
Columns are ordered as doc_id, document metadata, DTW output columns, and
finally text when text is requested and available.
Details
Prediction input is first aligned to the fitted vocabulary stored in x$vocab.
Terms absent from the fitted vocabulary are dropped with a warning, missing
fitted terms are added as zero columns, columns are reordered to fitted
vocabulary order, and any documents that become empty after alignment are
dropped with a warning.
STM prediction is supported only for fits without prevalence
covariates. For STM prevalence-covariate fits, predict_topic_model() errors
clearly rather than guessing new-document covariate handling.
Examples
dtm <- methods::as(
Matrix::Matrix(
matrix(
c(2, 1, 0, 0,
1, 1, 1, 0,
0, 1, 2, 1,
0, 0, 1, 2),
nrow = 4,
byrow = TRUE
),
sparse = TRUE
),
"dgCMatrix"
)
rownames(dtm) <- paste0("doc", 1:4)
colnames(dtm) <- paste0("term", 1:4)
fit <- fit_topic_model(
dtm,
engine = "text2vec",
model = "lda",
k = 2,
control = list(fit = list(n_iter = 25, progressbar = FALSE))
)
new_dtm <- methods::as(
Matrix::Matrix(
matrix(
c(1, 0, 0, 1,
0, 1, 1, 0),
nrow = 2,
byrow = TRUE
),
sparse = TRUE
),
"dgCMatrix"
)
rownames(new_dtm) <- c("new1", "new2")
colnames(new_dtm) <- paste0("term", 1:4)
predict_topic_model(fit, new_dtm)
#> INFO [11:32:16.349] early stopping at 10 iteration
#> doc_id Topic001 Topic002 topic_max_id topic_max_int topic_max_value
#> <char> <num> <num> <char> <int> <num>
#> 1: new1 0.5 0.5 Topic001 1 0.5
#> 2: new2 0.5 0.5 Topic001 1 0.5
