Fast Calculation of Similarity and Distance Measures — calculate

Compute similarity and distance measures in a single call to textstat_simil or textstat_dist, whose pairwise kernels run in quanteda's multithreaded C++ core. On top of the quanteda core, these wrappers validate their input, log the call, and scope quanteda's internal multithreading via threads.

Usage

calculate_similarity(x, threads = NULL, ncores = NULL, ...)

calculate_distance(x, threads = NULL, ncores = NULL, ...)

Arguments

x: A quanteda dfm object.
threads: Integer or NULL. Number of threads quanteda's internal (TBB) pool may use for this call. The previous setting is restored on exit. NULL (default) leaves quanteda::quanteda_options("threads") untouched — quanteda itself defaults to all available cores, so the default is already parallel. See the Threading section.
ncores: Deprecated since NLPstudio 1.2.0: it was an alias for what is now threads. Supplying it raises a warning of class NLPstudio_deprecated and maps its value to threads when threads is not given.
...: Additional arguments passed to textstat_simil or textstat_dist.

Value

A sparse matrix as S4 class following textstat_simil or textstat_dist from the Matrix package.

Details

Parallelism for all-pairs similarity/distance lives at the linear-algebra level inside quanteda's C++ core; splitting documents across worker processes cannot work (cross-chunk pairs would never be evaluated). threads bounds quanteda's thread pool for the duration of the call; NULL (default) respects the session-wide quanteda::quanteda_options("threads") setting, which quanteda itself defaults to all available cores. Note that up to NLPstudio 1.1.1 the old ncores argument defaulted to 1, silently throttling quanteda to a single thread; the new default inherits the (parallel) session setting.

When no second matrix y is supplied, the quanteda.textstats result is already a symmetric packed matrix (textstat_simil or textstat_dist) in input document order and is returned as is. Earlier versions round-tripped this object through a dense base matrix to reorder and re-pack it, which allocated a full dense N x N copy without changing the result; that step has been removed.

Examples

dfmat <- quanteda::dfm(quanteda::tokens(c(
  "this is a test", "another document",
  "more text here", "testing similarity"
)))

result_simil <- calculate_similarity(dfmat,
                                     margin = "documents",
                                     method = "cosine")
#> 
#> ── Calculating similarity ──
#> 
#> ℹ textstat_simil() called with the following parameters
#> → margin = documents
#> → method = cosine
#> ✔ Done

result_dist <- calculate_distance(dfmat,
                                  margin = "documents",
                                  method = "euclidean")
#> 
#> ── Calculating distance ──
#> 
#> ℹ textstat_dist() called with the following parameters
#> → margin = documents
#> → method = euclidean
#> ✔ Done