Skip to contents

define_corpus() builds a quanteda::corpus() from structured text data contained in a data.table::data.table, typically created by from_json_to_df(). The method ensures that each document has a unique identifier and attaches it as a document variable.

Usage

define_corpus(x, ...)

# Default S3 method
define_corpus(x, ...)

# S3 method for class 'data.table'
define_corpus(x, ...)

Arguments

x

A data.table::data.table with at least two columns: text (character vector of document texts) and filename (character vector of source file names). Usually this is the output of from_json_to_df().

...

Currently not used.

Value

A quanteda::corpus() object with a set of document-level variables (i.e., docvars).

Details

The function constructs a doc_id_corpus variable by combining the filename (stripped of extensions .htm or .txt) with the item column. This identifier is used as the document ID when building the quanteda corpus. If duplicate IDs are detected, a warning is issued.

After the corpus is built, temporary columns (filename2 and doc_id_corpus) are removed from the input table, so that only the corpus object is returned.

Although one could call quanteda::corpus() directly on the output of from_json_to_df(), it is recommended to use define_corpus(). This ensures consistent handling of document IDs, automatic duplicate checks, and integration with the rest of the NLPstudio pipeline.

See also

Examples

dt <- data.table::data.table(
  filename = c("filing_a.txt", "filing_b.txt"),
  item = c("item1", "item1"),
  text = c(
    "The first filing contains a short disclosure.",
    "The second filing contains another disclosure."
  )
)

corp <- define_corpus(dt)
#> 
#> ── Building corpus from data.table ──
#> 
#>  Corpus built with 2 documents

summary(corp)
#> Corpus consisting of 2 documents, showing 2 documents:
#> 
#>            Text Types Tokens Sentences     filename  item filename2
#>  filing_a_item1     8      8         1 filing_a.txt item1  filing_a
#>  filing_b_item1     7      7         1 filing_b.txt item1  filing_b
#> 
quanteda::docvars(corp)
#>       filename  item filename2
#> 1 filing_a.txt item1  filing_a
#> 2 filing_b.txt item1  filing_b