Summarize a quanteda::corpus() in parallel using backends from the
parallel package. By default, summarization is parallelized with
PSOCK clusters (parallel::clusterApplyLB()) for stable cross-platform
performance and dynamic load balancing. Optionally, FORK-based
parallelism (parallel::mclapply()) may be requested on Linux/macOS,
but this can lead to instability with quanteda (see Note).
Usage
summarize_corpus(
x,
ncores = 1,
nchunks = ncores,
socket = c("PSOCK", "FORK"),
...
)Arguments
- x
A
quanteda::corpus()object.- ncores
Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).
- nchunks
Integer. Number of chunks to split the corpus into. Defaults to
ncores. Settingnchunks > ncorescan improve load balancing when documents vary in size. See Details.- socket
Character. Parallel backend to use. One of
"PSOCK"(default, recommended) or"FORK". On Windows,"FORK"is not supported and will trigger an error.- ...
Additional arguments passed to [quanteda.textstats::textstat_summary().
Value
A data.table object with detailed information about each document.
Details
More details discussing the parallel strategy are given in tokenize_corpus().
Note
By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS
may be faster but is discouraged when tokenizing large corpora with
quanteda, as it can lead to undefined behavior. If you insist on using
socket = "FORK", consider setting environment variables such as
OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts.
On Windows, setting socket = "FORK" will result in an error.
Examples
corp <- quanteda::corpus(c(
doc1 = "A short document with simple language.",
doc2 = "A second text with more tokens and more variation."
))
summarize_corpus(corp)
#>
#> ── Summarizing corpus ──
#>
#> ℹ Summarizing sequentially
#> ✔ Corpus summarization complete
#> doc_id chars sents tokens types puncts numbers symbols urls tags emojis
#> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1: doc1 38 1 7 7 1 0 0 0 0 0
#> 2: doc2 50 1 10 9 1 0 0 0 0 0
