Skip to contents

Lookup tokens in parallel using backends from the parallel package. By default, lookup is parallelized with PSOCK clusters (parallel::clusterApplyLB()) for stable cross-platform performance and dynamic load balancing. Optionally, FORK-based parallelism (parallel::mclapply()) may be requested on Linux/macOS, but this can lead to instability with quanteda (see Note).

Usage

lookup_tokens(
  x,
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  ...
)

Arguments

x

A quanteda::corpus() object.

ncores

Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).

nchunks

Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size. See Details.

socket

Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.

...

Additional arguments passed to quanteda::tokens_lookup().

Value

A quanteda::tokens() object with lookups applied and documents in the same order as the input.

Details

More details discussing the parallel strategy are given in tokenize_corpus().

Note

By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS may be faster but is discouraged when tokenizing large corpora with quanteda, as it can lead to undefined behavior. If you insist on using socket = "FORK", consider setting environment variables such as OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts. On Windows, setting socket = "FORK" will result in an error.

Examples

corp <- quanteda::corpus(c(
  doc1 = "Cats and dogs run quickly.",
  doc2 = "Markets and firms react to policy news."
))
toks <- tokenize_corpus(corp)
#> 
#> ── Tokenizing corpus ──
#> 
#>  quanteda::tokens() has been called with default parameters
#>  Tokenizing sequentially
#>  Corpus successfully tokenized

dict <- quanteda::dictionary(list(
  animals = c("cat*", "dog*"),
  economics = c("market*", "firm*", "policy")
))

lookup_tokens(toks, dictionary = dict)
#>  quanteda::tokens_lookup() has been called with default parameters
#> 
#> ── Lookup tokens ──
#> 
#>  Lookup sequentially
#>  Lookup complete
#> Tokens consisting of 2 documents.
#> doc1 :
#> [1] "animals" "animals"
#> 
#> doc2 :
#> [1] "economics" "economics" "economics"
#>