Fast Corpus Reshape — reshape_corpus • NLPstudio

Reshape a quanteda::corpus() into smaller units (typically sentences or paragraphs) in a single call to quanteda::corpus_reshape(). On top of the quanteda core, this wrapper validates its input, logs progress, and scopes quanteda's internal multithreading via threads.

Usage

reshape_corpus(
  x,
  to = "sentences",
  threads = NULL,
  ncores = NULL,
  nchunks = NULL,
  socket = NULL,
  ...
)

Arguments

x: A quanteda::corpus() object.
to: Character. Reshape target, passed to quanteda::corpus_reshape(). Defaults to "sentences".
threads: Integer or NULL. Number of threads quanteda's internal (TBB) pool may use for this call. The previous setting is restored on exit. NULL (default) leaves quanteda::quanteda_options("threads") untouched — quanteda itself defaults to all available cores, so the default is already parallel. See the Threading section.
ncores, nchunks, socket: Deprecated since NLPstudio 1.2.0 and ignored: the chunked process-level (PSOCK/FORK) backend has been removed because it duplicated quanteda's internal multithreading while adding cluster startup, serialization, and peak-memory overhead. Supplying any of them raises a warning of class NLPstudio_deprecated; ncores is mapped to threads when threads is not given. These arguments will be removed in a future release.
...: Additional arguments passed to quanteda::corpus_reshape().

Value

A reshaped quanteda::corpus() with the same document variables and reshaped text units as defined by to.

Details

The chunked process-level backend was removed in NLPstudio 1.2.0; see the Threading section of tokenize_corpus() and vignette("performance-and-threading", package = "NLPstudio"). Reshaping in one call also guarantees globally consistent segment numbering, which chunked reshaping could not.

Examples

corp <- quanteda::corpus(c(
  doc1 = "First sentence. Second sentence.",
  doc2 = "Another document. With two parts."
))

reshape_corpus(corp, to = "sentences")
#> 
#> ── Reshaping corpus ──
#> 
#> ✔ Corpus successfully reshaped
#> Corpus consisting of 4 documents.
#> doc1.1 :
#> "First sentence."
#> 
#> doc1.2 :
#> "Second sentence."
#> 
#> doc2.1 :
#> "Another document."
#> 
#> doc2.2 :
#> "With two parts."
#>