Fast Corpus Summarization — summarize

Summarize a quanteda::corpus() in a single call to quanteda.textstats::textstat_summary() (whose tokenization and counting run in quanteda's multithreaded C++ core). On top of the quanteda core, this wrapper validates its input, returns a data.table keyed by doc_id, and scopes quanteda's internal multithreading via threads.

Usage

summarize_corpus(
  x,
  threads = NULL,
  ncores = NULL,
  nchunks = NULL,
  socket = NULL,
  ...
)

Arguments

x: A quanteda::corpus() object.
threads: Integer or NULL. Number of threads quanteda's internal (TBB) pool may use for this call. The previous setting is restored on exit. NULL (default) leaves quanteda::quanteda_options("threads") untouched — quanteda itself defaults to all available cores, so the default is already parallel. See the Threading section.
ncores, nchunks, socket: Deprecated since NLPstudio 1.2.0 and ignored: the chunked process-level (PSOCK/FORK) backend has been removed because it duplicated quanteda's internal multithreading while adding cluster startup, serialization, and peak-memory overhead. Supplying any of them raises a warning of class NLPstudio_deprecated; ncores is mapped to threads when threads is not given. These arguments will be removed in a future release.
...: Additional arguments passed to quanteda.textstats::textstat_summary().

Value

A data.table object with detailed information about each document.

Details

The chunked process-level backend was removed in NLPstudio 1.2.0; see the Threading section of tokenize_corpus() and vignette("performance-and-threading", package = "NLPstudio").

Examples

corp <- quanteda::corpus(c(
  doc1 = "A short document with simple language.",
  doc2 = "A second text with more tokens and more variation."
))

summarize_corpus(corp)
#> 
#> ── Summarizing corpus ──
#> 
#> ✔ Corpus summarization complete
#>    doc_id chars sents tokens types puncts numbers symbols  urls  tags emojis
#>    <char> <int> <int>  <int> <int>  <int>   <int>   <int> <int> <int>  <int>
#> 1:   doc1    38     1      7     7      1       0       0     0     0      0
#> 2:   doc2    50     1     10     9      1       0       0     0     0      0