Skip to contents

Summarize a quanteda::corpus() in parallel using backends from the parallel package. By default, summarization is parallelized with PSOCK clusters (parallel::clusterApplyLB()) for stable cross-platform performance and dynamic load balancing. Optionally, FORK-based parallelism (parallel::mclapply()) may be requested on Linux/macOS, but this can lead to instability with quanteda (see Note).

Usage

summarize_corpus(
  x,
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  ...
)

Arguments

x

A quanteda::corpus() object.

ncores

Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).

nchunks

Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size. See Details.

socket

Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.

...

Additional arguments passed to [quanteda.textstats::textstat_summary().

Value

A data.table object with detailed information about each document.

Details

More details discussing the parallel strategy are given in tokenize_corpus().

Note

By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS may be faster but is discouraged when tokenizing large corpora with quanteda, as it can lead to undefined behavior. If you insist on using socket = "FORK", consider setting environment variables such as OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts. On Windows, setting socket = "FORK" will result in an error.

Examples

corp <- quanteda::corpus(c(
  doc1 = "A short document with simple language.",
  doc2 = "A second text with more tokens and more variation."
))

summarize_corpus(corp)
#> 
#> ── Summarizing corpus ──
#> 
#>  Summarizing sequentially
#>  Corpus summarization complete
#>    doc_id chars sents tokens types puncts numbers symbols  urls  tags emojis
#>    <char> <int> <int>  <int> <int>  <int>   <int>   <int> <int> <int>  <int>
#> 1:   doc1    38     1      7     7      1       0       0     0     0      0
#> 2:   doc2    50     1     10     9      1       0       0     0     0      0