
Fast Calculation of Similarity and Distance Measures
Source:R/calculate_simil_dist.R
calculate_similarity.RdCompute similarity and distance measures using textstat_simil or textstat_dist, optionally routing additional CPU threads through quanteda's built-in OpenMP backend.
Arguments
- x
A quanteda dfm object.
- ncores
Integer. Number of threads to pass to
quanteda::quanteda_options()for the duration of the call. Defaults to 1 (quanteda's own default). The setting is restored on exit so it does not affect the caller's global state.- ...
Additional arguments passed to textstat_simil or textstat_dist.
Value
A sparse matrix as S4 class following textstat_simil or textstat_dist from the Matrix package.
Details
Earlier versions of these functions attempted to parallelise by splitting the input dfm by rows and computing partial matrices on separate workers. That approach is fundamentally incorrect: each chunk only sees its own documents, so cross-chunk pairs are never evaluated and the merged result is block-diagonal rather than a true full pairwise matrix.
The correct parallelism for all-pairs similarity/distance is at the
linear-algebra level, which quanteda already implements internally via
OpenMP. Setting ncores > 1 therefore exposes that mechanism rather
than introducing an external worker pool.
If a second matrix (y) is not provided, the output is forced into a
symmetric structure using forceSymmetric and packed into a
dspMatrix for memory efficiency. The result is wrapped into
the appropriate quanteda.textstats S4 class
(textstat_simil or textstat_dist).
Examples
dfmat <- quanteda::dfm(quanteda::tokens(c(
"this is a test", "another document",
"more text here", "testing similarity"
)))
result_simil <- calculate_similarity(dfmat,
margin = "documents",
method = "cosine")
#>
#> ── Calculating similarity ──
#>
#> ℹ textstat_simil() called with the following parameters
#> → margin = documents
#> → method = cosine
#> ℹ Using 1 thread(s) via quanteda
#> ✔ Done
result_dist <- calculate_distance(dfmat,
margin = "documents",
method = "euclidean")
#>
#> ── Calculating distance ──
#>
#> ℹ textstat_dist() called with the following parameters
#> → margin = documents
#> → method = euclidean
#> ℹ Using 1 thread(s) via quanteda
#> ✔ Done