Skip to contents

Singularize tokens from a quanteda tokens object using a parallel hashing strategy and an internal English singularization rule set. Short tokens can optionally be removed.

Usage

singularize_tokens(
  x,
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  remove_numbers = TRUE,
  min_char = 1
)

Arguments

x

A quanteda::tokens object containing tokenized text.

ncores

Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).

nchunks

Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size. See Details.

socket

Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.

remove_numbers

Logical. If TRUE (default), removes tokens that contain any digits. This avoids producing incorrect singular forms (e.g., "000s""000").

min_char

Integer. Minimum number of characters a token must have to be retained. Tokens shorter than this threshold are removed entirely. Defaults to 1.

Value

A quanteda::tokens object with singularized tokens.

Details

More details discussing the parallel strategy are given in tokenize_corpus().

Note

On Linux/macOS, "FORK" may be faster but can be unstable with quanteda’s C++/OpenMP internals. Use "PSOCK" for maximum stability. On Windows, "FORK" is not available.

Examples

if (FALSE) { # interactive()
corp <- quanteda::corpus(c(
  doc1 = "Cats chase birds and cars pass houses.",
  doc2 = "Companies file reports and managers review numbers."
))
toks <- tokenize_corpus(corp)

singularize_tokens(toks, min_char = 3)

}