define_corpus() builds a quanteda::corpus() from structured text data
contained in a data.table::data.table, typically created by
from_json_to_df(). The method ensures that each document has a unique
identifier and attaches it as a document variable.
Usage
define_corpus(x, ...)
# Default S3 method
define_corpus(x, ...)
# S3 method for class 'data.table'
define_corpus(x, ...)Arguments
- x
A data.table::data.table with at least two columns:
text(character vector of document texts) andfilename(character vector of source file names). Usually this is the output offrom_json_to_df().- ...
Currently not used.
Value
A quanteda::corpus() object with a set of document-level variables (i.e., docvars).
Details
The function constructs a doc_id_corpus variable by combining the
filename (stripped of extensions .htm or .txt) with the item
column. This identifier is used as the document ID when building the
quanteda corpus. If duplicate IDs are detected, a warning is issued.
After the corpus is built, temporary columns (filename2 and
doc_id_corpus) are removed from the input table, so that only the corpus
object is returned.
Although one could call quanteda::corpus() directly on the output of
from_json_to_df(), it is recommended to use define_corpus(). This
ensures consistent handling of document IDs, automatic duplicate checks,
and integration with the rest of the NLPstudio pipeline.
Examples
dt <- data.table::data.table(
filename = c("filing_a.txt", "filing_b.txt"),
item = c("item1", "item1"),
text = c(
"The first filing contains a short disclosure.",
"The second filing contains another disclosure."
)
)
corp <- define_corpus(dt)
#>
#> ── Building corpus from data.table ──
#>
#> ✔ Corpus built with 2 documents
summary(corp)
#> Corpus consisting of 2 documents, showing 2 documents:
#>
#> Text Types Tokens Sentences filename item filename2
#> filing_a_item1 8 8 1 filing_a.txt item1 filing_a
#> filing_b_item1 7 7 1 filing_b.txt item1 filing_b
#>
quanteda::docvars(corp)
#> filename item filename2
#> 1 filing_a.txt item1 filing_a
#> 2 filing_b.txt item1 filing_b
