Reshape a quanteda::corpus() into smaller units (typically sentences or
paragraphs) using parallel backends from the parallel package.
By default, reshaping is parallelized with PSOCK clusters
(parallel::clusterApplyLB()) for cross-platform stability and dynamic
load balancing. Optionally, FORK-based parallelism
(parallel::mclapply()) may be requested on Linux/macOS, but this can
lead to instability with quanteda (see Note).
Usage
reshape_corpus(
x,
to = "sentences",
ncores = 1,
nchunks = ncores,
socket = c("PSOCK", "FORK"),
...
)Arguments
- x
A
quanteda::corpus()object.- to
Character. Reshape target, passed to
quanteda::corpus_reshape(). Defaults to"sentences".- ncores
Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).
- nchunks
Integer. Number of chunks to split the corpus into. Defaults to
ncores. Settingnchunks > ncorescan improve load balancing when documents vary in size. See Details.- socket
Character. Parallel backend to use. One of
"PSOCK"(default, recommended) or"FORK". On Windows,"FORK"is not supported and will trigger an error.- ...
Additional arguments passed to
quanteda::corpus_reshape().
Value
A reshaped quanteda::corpus() with the same document variables
and reshaped text units as defined by to.
Details
More details discussing the parallel strategy are given in tokenize_corpus().
Note
By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS
may be faster but is discouraged when tokenizing large corpora with
quanteda, as it can lead to undefined behavior. If you insist on using
socket = "FORK", consider setting environment variables such as
OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts.
On Windows, setting socket = "FORK" will result in an error.
Examples
corp <- quanteda::corpus(c(
doc1 = "First sentence. Second sentence.",
doc2 = "Another document. With two parts."
))
reshape_corpus(corp, to = "sentences")
#>
#> ── Reshaping corpus ──
#>
#> ℹ Reshaping sequentially
#> ✔ Corpus successfully reshaped
#> Corpus consisting of 4 documents.
#> doc1.1 :
#> "First sentence."
#>
#> doc1.2 :
#> "Second sentence."
#>
#> doc2.1 :
#> "Another document."
#>
#> doc2.2 :
#> "With two parts."
#>
