Skip to contents

Reshape a quanteda::corpus() into smaller units (typically sentences or paragraphs) using parallel backends from the parallel package. By default, reshaping is parallelized with PSOCK clusters (parallel::clusterApplyLB()) for cross-platform stability and dynamic load balancing. Optionally, FORK-based parallelism (parallel::mclapply()) may be requested on Linux/macOS, but this can lead to instability with quanteda (see Note).

Usage

reshape_corpus(
  x,
  to = "sentences",
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  ...
)

Arguments

x

A quanteda::corpus() object.

to

Character. Reshape target, passed to quanteda::corpus_reshape(). Defaults to "sentences".

ncores

Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).

nchunks

Integer. Number of chunks to split the corpus into. Defaults to ncores. Setting nchunks > ncores can improve load balancing when documents vary in size. See Details.

socket

Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.

...

Additional arguments passed to quanteda::corpus_reshape().

Value

A reshaped quanteda::corpus() with the same document variables and reshaped text units as defined by to.

Details

More details discussing the parallel strategy are given in tokenize_corpus().

Note

By default, socket = "PSOCK". Using socket = "FORK" on Linux/macOS may be faster but is discouraged when tokenizing large corpora with quanteda, as it can lead to undefined behavior. If you insist on using socket = "FORK", consider setting environment variables such as OMP_NUM_THREADS=1 and/or quanteda_options(threads = 1)) to reduce conflicts. On Windows, setting socket = "FORK" will result in an error.

Examples

corp <- quanteda::corpus(c(
  doc1 = "First sentence. Second sentence.",
  doc2 = "Another document. With two parts."
))

reshape_corpus(corp, to = "sentences")
#> 
#> ── Reshaping corpus ──
#> 
#>  Reshaping sequentially
#>  Corpus successfully reshaped
#> Corpus consisting of 4 documents.
#> doc1.1 :
#> "First sentence."
#> 
#> doc1.2 :
#> "Second sentence."
#> 
#> doc2.1 :
#> "Another document."
#> 
#> doc2.2 :
#> "With two parts."
#>