Skip to contents

Compute similarity and distance measures using textstat_simil or textstat_dist, optionally routing additional CPU threads through quanteda's built-in OpenMP backend.

Usage

calculate_similarity(x, ncores = 1, ...)

calculate_distance(x, ncores = 1, ...)

Arguments

x

A quanteda dfm object.

ncores

Integer. Number of threads to pass to quanteda::quanteda_options() for the duration of the call. Defaults to 1 (quanteda's own default). The setting is restored on exit so it does not affect the caller's global state.

...

Additional arguments passed to textstat_simil or textstat_dist.

Value

A sparse matrix as S4 class following textstat_simil or textstat_dist from the Matrix package.

Details

Earlier versions of these functions attempted to parallelise by splitting the input dfm by rows and computing partial matrices on separate workers. That approach is fundamentally incorrect: each chunk only sees its own documents, so cross-chunk pairs are never evaluated and the merged result is block-diagonal rather than a true full pairwise matrix.

The correct parallelism for all-pairs similarity/distance is at the linear-algebra level, which quanteda already implements internally via OpenMP. Setting ncores > 1 therefore exposes that mechanism rather than introducing an external worker pool.

If a second matrix (y) is not provided, the output is forced into a symmetric structure using forceSymmetric and packed into a dspMatrix for memory efficiency. The result is wrapped into the appropriate quanteda.textstats S4 class (textstat_simil or textstat_dist).

Examples

dfmat <- quanteda::dfm(quanteda::tokens(c(
  "this is a test", "another document",
  "more text here", "testing similarity"
)))

result_simil <- calculate_similarity(dfmat,
                                     margin = "documents",
                                     method = "cosine")
#> 
#> ── Calculating similarity ──
#> 
#>  textstat_simil() called with the following parameters
#> → margin = documents
#> → method = cosine
#>  Using 1 thread(s) via quanteda
#>  Done

result_dist <- calculate_distance(dfmat,
                                  margin = "documents",
                                  method = "euclidean")
#> 
#> ── Calculating distance ──
#> 
#>  textstat_dist() called with the following parameters
#> → margin = documents
#> → method = euclidean
#>  Using 1 thread(s) via quanteda
#>  Done