Fast Corpus Tokenization — tokenize

Tokenize a quanteda::corpus() in a single call to quanteda::tokens(), whose C++ tokenizer is already multithreaded (Intel TBB). On top of the quanteda core, this wrapper validates its input, logs the call, verifies that every document survives tokenization, and scopes the number of quanteda threads for the duration of the call via threads.

Usage

tokenize_corpus(
  x,
  threads = NULL,
  ncores = NULL,
  nchunks = NULL,
  socket = NULL,
  ...
)

Arguments

x: A quanteda::corpus() object.
threads: Integer or NULL. Number of threads quanteda's internal (TBB) pool may use for this call. The previous setting is restored on exit. NULL (default) leaves quanteda::quanteda_options("threads") untouched — quanteda itself defaults to all available cores, so the default is already parallel. See the Threading section.
ncores, nchunks, socket: Deprecated since NLPstudio 1.2.0 and ignored: the chunked process-level (PSOCK/FORK) backend has been removed because it duplicated quanteda's internal multithreading while adding cluster startup, serialization, and peak-memory overhead. Supplying any of them raises a warning of class NLPstudio_deprecated; ncores is mapped to threads when threads is not given. These arguments will be removed in a future release.
...: Additional arguments passed to quanteda::tokens().

Value

A quanteda::tokens() object containing tokenized documents with the same number and order of documents as the input corpus.

Threading

Up to NLPstudio 1.1.1 this function split the corpus into chunks and scattered them over a PSOCK cluster. Since quanteda 4 the tokenizer is itself parallel (one TBB pool per R process), so the chunked design paid cluster startup, per-chunk serialization, and 2-3x peak memory for no speedup — and each worker could additionally oversubscribe the CPU with its own thread pool. The chunk layer is gone: one call, one thread pool.

Use threads to bound quanteda's parallelism for this call (for example on shared servers or inside your own outer parallel loops); leave it NULL to respect the session-wide quanteda::quanteda_options("threads") setting. See vignette("performance-and-threading", package = "NLPstudio") for how quanteda's TBB pool, BLAS, and data.table threads compose.

Examples

corp <- quanteda::corpus(
  c("Cats are running", "Dogs were barking")
)

toks <- tokenize_corpus(corp, remove_punct = TRUE)
#> 
#> ── Tokenizing corpus ──
#> 
#> ℹ quanteda::tokens() has been called with user parameters
#> → remove_punct = TRUE
#> ✔ Corpus successfully tokenized
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "Cats"    "are"     "running"
#> 
#> text2 :
#> [1] "Dogs"    "were"    "barking"
#>