Singularize tokens from a quanteda tokens object using a parallel hashing strategy and an internal English singularization rule set. Short tokens can optionally be removed.
Usage
singularize_tokens(
x,
ncores = 1,
nchunks = ncores,
socket = c("PSOCK", "FORK"),
remove_numbers = TRUE,
min_char = 1
)Arguments
- x
A quanteda::tokens object containing tokenized text.
- ncores
Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).
- nchunks
Integer. Number of chunks to split the corpus into. Defaults to
ncores. Settingnchunks > ncorescan improve load balancing when documents vary in size. See Details.- socket
Character. Parallel backend to use. One of
"PSOCK"(default, recommended) or"FORK". On Windows,"FORK"is not supported and will trigger an error.- remove_numbers
Logical. If
TRUE(default), removes tokens that contain any digits. This avoids producing incorrect singular forms (e.g.,"000s"→"000").- min_char
Integer. Minimum number of characters a token must have to be retained. Tokens shorter than this threshold are removed entirely. Defaults to 1.
Value
A quanteda::tokens object with singularized tokens.
Details
More details discussing the parallel strategy are given in tokenize_corpus().
Note
On Linux/macOS, "FORK" may be faster but can be unstable with
quanteda’s C++/OpenMP internals. Use "PSOCK" for maximum stability. On
Windows, "FORK" is not available.
Examples
if (FALSE) { # interactive()
corp <- quanteda::corpus(c(
doc1 = "Cats chase birds and cars pass houses.",
doc2 = "Companies file reports and managers review numbers."
))
toks <- tokenize_corpus(corp)
singularize_tokens(toks, min_char = 3)
}
