Lookup tokens in parallel using backends from the parallel
package. By default, lookup is parallelized with PSOCK clusters
(parallel::clusterApplyLB()) for stable cross-platform performance
and dynamic load balancing. Optionally, FORK-based parallelism
(parallel::mclapply()) may be requested on Linux/macOS, but this
can lead to instability with quanteda (see Note).
Usage
lookup_tokens(
x,
ncores = 1,
nchunks = ncores,
socket = c("PSOCK", "FORK"),
...
)Arguments
- x
A
quanteda::corpus()object.- ncores
Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).
- nchunks
Integer. Number of chunks to split the corpus into. Defaults to
ncores. Settingnchunks > ncorescan improve load balancing when documents vary in size. See Details.- socket
Character. Parallel backend to use. One of
"PSOCK"(default, recommended) or"FORK". On Windows,"FORK"is not supported and will trigger an error.- ...
Additional arguments passed to
quanteda::tokens_lookup().
Value
A quanteda::tokens() object with lookups applied and
documents in the same order as the input.
Details
More details discussing the parallel strategy are given in tokenize_corpus().
Note
By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS
may be faster but is discouraged when tokenizing large corpora with
quanteda, as it can lead to undefined behavior. If you insist on using
socket = "FORK", consider setting environment variables such as
OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts.
On Windows, setting socket = "FORK" will result in an error.
Examples
corp <- quanteda::corpus(c(
doc1 = "Cats and dogs run quickly.",
doc2 = "Markets and firms react to policy news."
))
toks <- tokenize_corpus(corp)
#>
#> ── Tokenizing corpus ──
#>
#> ℹ quanteda::tokens() has been called with default parameters
#> ℹ Tokenizing sequentially
#> ✔ Corpus successfully tokenized
dict <- quanteda::dictionary(list(
animals = c("cat*", "dog*"),
economics = c("market*", "firm*", "policy")
))
lookup_tokens(toks, dictionary = dict)
#> ℹ quanteda::tokens_lookup() has been called with default parameters
#>
#> ── Lookup tokens ──
#>
#> ℹ Lookup sequentially
#> ✔ Lookup complete
#> Tokens consisting of 2 documents.
#> doc1 :
#> [1] "animals" "animals"
#>
#> doc2 :
#> [1] "economics" "economics" "economics"
#>
