Fast Corpus Parsing via spaCy

Parse a corpus in parallel using backends from the parallel package. This function is a wrapper of spacy_parse(). Thus, it is critical to have a working installation of the spacyr package. Please refer to the installation guide to troubleshoot issues.

Usage

parse_corpus(x, ncores = 1, nchunks = ncores, socket = c("PSOCK", "FORK"), ...)

Arguments

x: A quanteda corpus.
ncores: Integer. Number of worker processes to parse chunks with. Defaults to 1 (sequential). Unlike the quanteda-backed verbs (which now thread internally, see tokenize_corpus()), spaCy runs as an external single-threaded pipeline, so process-level parallelism is the correct strategy here and is retained.
nchunks: Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size.
socket: Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.
...: Additional arguments passed to spacy_parse().

Value

A data.table of tokenized, parsed, and annotated tokens.

Details

The workhorse of this function is spacy_parse() such that all the usual parameters can be passed to parse_corpus() as well. It is critical to have a proper installation of the spaCy library and all of its components. parse_corpus() does not initialize any instance of spaCy so call spacy_initialize() beforehand.

In particular, one can pass and use any language model as currently supported by version 3.7 via the argument model in spacy_initialize(). By default, spacy_install() downloads and uses the smallest English model en_core_web_sm. It is recommended to use spacy_download_langmodel() to properly download and activate the desired model.

To avoid any issue, parse_corpus() finalizes the session if one is active via spacy_finalize() on.exit(). If no session is active, parse_corpus() will error on exit.

Note

Although parsing can be parallelized across multiple CPU cores, memory usage grows quickly with both the number of cores and the size of the corpus. On large corpora, allocating too many workers may exhaust available RAM and significantly slow down or even terminate the process. It is recommended to increase ncores gradually and monitor memory consumption.

Note that the returned data.table may contain a very large number of rows when x is large, which also can have implications for memory usage and downstream processing.

Author

Francesco Grossetti francesco.grossetti@unibocconi.it

Examples

if (FALSE) { # interactive()
# Requires the optional spacyr package, a local spaCy installation,
# and an initialized language model.
corp <- quanteda::corpus(c(
  doc1 = "This is a simple example sentence.",
  doc2 = "NLPstudio can parse corpora in parallel."
))

spacyr::spacy_initialize()
parsed <- parse_corpus(corp, lemma = TRUE, pos = TRUE)
head(parsed)
spacyr::spacy_finalize()
}