Fast Tokens Singularization — singularize

Singularize tokens from a quanteda tokens object using an internal English singularization rule set applied once per unique token type and mapped back with quanteda::tokens_replace(). Short tokens can optionally be removed.

Usage

singularize_tokens(
  x,
  remove_numbers = TRUE,
  min_char = 1,
  ncores = NULL,
  nchunks = NULL,
  socket = NULL
)

Arguments

x: A quanteda::tokens object containing tokenized text.
remove_numbers: Logical. If TRUE (default), removes tokens that contain any digits. This avoids producing incorrect singular forms (e.g., "000s" → "000").
min_char: Integer. Minimum number of characters a token must have to be retained. Tokens shorter than this threshold are removed entirely. Defaults to 1.
ncores, nchunks, socket: Deprecated since NLPstudio 1.2.0 and ignored: singularization is vocabulary-level (one rule pass over the unique token types), for which worker processes cost more than they save. Supplying any of them raises a warning of class NLPstudio_deprecated.

Value

A quanteda::tokens object with singularized tokens.

Details

The rule set is applied to the vocabulary (unique token types), not to every token occurrence, so the cost scales with vocabulary size. Since NLPstudio 1.2.0 the vocabulary is read directly from the tokens object with quanteda::types() (previously a full document-feature matrix was built just to list the vocabulary) and the rules are evaluated as vectorized passes over the whole vocabulary. Because quanteda::types() preserves case, mixed-case plurals (e.g. "Companies") are now matched and singularized with their case shape restored; the previous implementation lower-cased the vocabulary and silently skipped them.

Examples

if (FALSE) { # interactive()
corp <- quanteda::corpus(c(
  doc1 = "Cats chase birds and cars pass houses.",
  doc2 = "Companies file reports and managers review numbers."
))
toks <- tokenize_corpus(corp)

singularize_tokens(toks, min_char = 3)

}