Skip to contents

Tokenize a quanteda::corpus() in parallel using backends from the parallel package. By default, tokenization is parallelized with PSOCK clusters (parallel::clusterApplyLB()) for stability across platforms. Optionally, FORK-based parallelism (parallel::mclapply()) may be requested on Linux/macOS, but this can lead to crashes or silent failures because quanteda uses C++ and OpenMP internally.

Usage

tokenize_corpus(
  x,
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  ...
)

Arguments

x

A quanteda::corpus() object.

ncores

Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).

nchunks

Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size. See Details.

socket

Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.

...

Additional arguments passed to quanteda::tokens().

Value

A quanteda::tokens() object containing tokenized documents with the same number and order of documents as the input corpus.

Details

The corpus is first split into balanced chunks of documents depending on ncores. Each chunk is tokenized in parallel by quanteda::tokens(), and the resulting token objects are combined. Original document order is restored before returning the result. The corpus is split into nchunks balanced chunks by document IDs. With PSOCK backends, parallel::clusterApplyLB() is used to assign chunks dynamically across ncores workers. This improves utilization when document sizes are highly variable. With FORK, parallel::mclapply() is used, which distributes chunks upfront.

Choosing the relationship between ncores and nchunks has important performance implications:

  • When nchunks == ncores (the default), each worker processes exactly one chunk. This minimizes splitting overhead and is appropriate when documents are relatively homogeneous in length.

  • When nchunks > ncores, there are more chunks than workers. Workers receive chunks dynamically and pick up additional work as soon as they finish their current assignment. This improves load balancing when documents vary widely in size or complexity, but introduces some overhead from managing more tasks.

  • When nchunks < ncores, some workers will remain idle because there are not enough chunks to fully occupy all cores. This reduces overhead but wastes available parallel resources so it's not a recommended setting.

In practice, setting nchunks slightly larger than ncores (e.g., 2-4x) often gives the best balance between parallel efficiency and scheduling overhead for large, heterogeneous corpora.

On small corpora, parallelization may add overhead compared to sequential tokenization. For large corpora, using multiple cores with the PSOCK backend typically yields the best balance of performance and reliability. Although FORK can be faster by avoiding serialization, it is less stable when combined with quanteda's use of C++/OpenMP.

Note

By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS may be faster but is discouraged when tokenizing large corpora with quanteda, as it can lead to undefined behavior. If you insist on using socket = "FORK", consider setting environment variables such as OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts. On Windows, setting socket = "FORK" will result in an error.

Examples

corp <- quanteda::corpus(
  c("Cats are running", "Dogs were barking")
)

toks <- tokenize_corpus(corp, remove_punct = TRUE)
#> 
#> ── Tokenizing corpus ──
#> 
#>  quanteda::tokens() has been called with user parameters
#> → remove_punct = TRUE
#>  Tokenizing sequentially
#>  Corpus successfully tokenized
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "Cats"    "are"     "running"
#> 
#> text2 :
#> [1] "Dogs"    "were"    "barking"
#>