Fast Tokens Lookup — lookup_tokens • NLPstudio

Apply a dictionary to a quanteda::tokens() object in a single call to quanteda::tokens_lookup(), whose matcher runs in quanteda's multithreaded C++ core. On top of the quanteda core, this wrapper validates its input, logs the call, and scopes quanteda's internal multithreading via threads.

Usage

lookup_tokens(
  x,
  threads = NULL,
  ncores = NULL,
  nchunks = NULL,
  socket = NULL,
  ...
)

Arguments

x: A quanteda::tokens object.
threads: Integer or NULL. Number of threads quanteda's internal (TBB) pool may use for this call. The previous setting is restored on exit. NULL (default) leaves quanteda::quanteda_options("threads") untouched — quanteda itself defaults to all available cores, so the default is already parallel. See the Threading section.
ncores, nchunks, socket: Deprecated since NLPstudio 1.2.0 and ignored: the chunked process-level (PSOCK/FORK) backend has been removed because it duplicated quanteda's internal multithreading while adding cluster startup, serialization, and peak-memory overhead. Supplying any of them raises a warning of class NLPstudio_deprecated; ncores is mapped to threads when threads is not given. These arguments will be removed in a future release.
...: Additional arguments passed to quanteda::tokens_lookup() (typically dictionary).

Value

A quanteda::tokens() object with lookups applied and documents in the same order as the input.

Details

The chunked process-level backend was removed in NLPstudio 1.2.0: besides duplicating quanteda's own parallelism, it re-serialized the dictionary to the workers once per chunk, which dominated runtime for large curated dictionaries. See the Threading section of tokenize_corpus() and vignette("performance-and-threading", package = "NLPstudio").

Examples

corp <- quanteda::corpus(c(
  doc1 = "Cats and dogs run quickly.",
  doc2 = "Markets and firms react to policy news."
))
toks <- tokenize_corpus(corp)
#> 
#> ── Tokenizing corpus ──
#> 
#> ℹ quanteda::tokens() has been called with default parameters
#> ✔ Corpus successfully tokenized

dict <- quanteda::dictionary(list(
  animals = c("cat*", "dog*"),
  economics = c("market*", "firm*", "policy")
))

lookup_tokens(toks, dictionary = dict)
#> ℹ quanteda::tokens_lookup() has been called with default parameters
#> 
#> ── Lookup tokens ──
#> 
#> ✔ Lookup complete
#> Tokens consisting of 2 documents.
#> doc1 :
#> [1] "animals" "animals"
#> 
#> doc2 :
#> [1] "economics" "economics" "economics"
#>