Changelog • NLPstudio

NLPstudio 1.2.0 (2026-07-23)

This is a performance release: same results, less time, less memory. A full audit of CPU and memory behavior reorganized how the package uses quanteda’s internal multithreading, BLAS, worker processes, and dense matrices. A benchmark harness (under bench/, with a committed v1.1.1 baseline) now documents every claim; the new performance-and-threading vignette explains the complete threading model.

NEW FEATURES

All quanteda-backed corpus verbs (tokenize_corpus(), reshape_corpus(), summarize_corpus(), lookup_tokens(), ngram_tokens(), compound_tokens(), calculate_readability(), calculate_similarity(), calculate_distance()) gained a threads argument that scopes quanteda’s internal (TBB) thread pool for the duration of the call and restores the previous setting on exit. threads = NULL (default) respects the session-wide quanteda::quanteda_options("threads") setting, which quanteda itself defaults to all cores.
fit_topic_model() and as_nlp_topic_fit() gained keep_backend_data (default FALSE): seededlda fits no longer retain a full copy of the input dfm inside model_object$data; a zero-count dfm with identical dimensions, dimnames, and docvars replaces it, freeing O(nonzeros) per fit while every predict/print/extraction path keeps working. Set to TRUE to retain the counts.
select_k_topics() gained stability_reuse_fit (default FALSE) and assess_topic_stability() gained initial_fit: opt in to reuse each candidate K’s evaluation fit as the first stability run, saving one refit per K (the reference run then differs from the default).
New vignette performance-and-threading: the three thread pools of an R session (quanteda TBB, BLAS, data.table OpenMP), which knob controls each, the worker-cluster hygiene contract, HPC guidance, and the memory levers for topic-model fits.

CHANGES

The corpus verbs above now run as a single call inside quanteda’s multithreaded C++ core. The previous chunked PSOCK backend duplicated quanteda’s own parallelism while paying cluster startup, per-chunk serialization of constant arguments (dictionaries, phrase patterns), O(n^2) re-concatenation, and an eager chunk copy even on the default sequential path; against the committed baseline the old ncores = 4 paths were 3-16x slower than sequential, and the new single calls match or beat sequential with 28-50% fewer allocations. Process-level parallelism remains where it genuinely helps: parse_corpus() (external spaCy pipelines), from_json_to_df() (I/O-bound ingestion), and the model-selection layer.
calculate_similarity()/calculate_distance() no longer round-trip the symmetric result through a dense N x N base matrix (the packed sparse result is returned directly), and their old ncores default of 1 no longer silently throttles quanteda to a single thread: the new threads default respects the session setting. The dense round-trip’s removal also fixes margin = "features" without y, which previously errored.
evaluate_topic_model() builds the dense topics-by-vocabulary matrix at most once per call (previously up to four times on return_tww = FALSE fits) and shares one top-N term index across coherence, diversity, and exclusivity; top-N ranking uses partial sorts instead of full O(V log V) sorts per topic; exclusivity no longer materializes a full K x V intermediate; likelihood metrics subset before transposing and process sparse triplets in bounded blocks. Benchmarks with a 50k-term vocabulary: get_top_terms() 29-36x faster; evaluate_topic_model() 16x (cached TWW) to 74x (lean fits) faster with peak allocations down 33-60%. Lean fits now evaluate at cached-fit speed, making return_tww = FALSE a pure memory saving.
select_k_topics() and assess_topic_stability() ship the training matrix to each parallel worker exactly once; per-task payloads collapse to the (k, seed) pair. Previously the worker closures captured the calling frame and the full corpus was re-serialized with every K/seed task. Every package-created PSOCK cluster now caps each worker’s thread pools to one thread (quanteda TBB, data.table, and BLAS - the latter via environment variables inherited at spawn, plus RhpcBLASctl when installed), eliminating the ncores x threads oversubscription that could saturate every core during parallel grid searches.
Grid evaluation reuses invariant work: the coherence training preparation and the likelihood vocabulary alignment are computed once per grid (or once per worker) instead of once per K. The STM documents builder is O(nonzeros) instead of O(documents x nonzeros) and is no longer invoked twice per prediction, which also speeds up STM fitting. Stability assessment streams topic alignment (two aligned matrices in memory instead of all runs twice) and reads each fit’s topic-word matrix without a wide-table round trip.
singularize_tokens() reads the vocabulary via quanteda::types() instead of building a full document-feature matrix, applies min_char before vocabulary extraction, and evaluates its rule set as vectorized passes over the vocabulary. Because types() preserves case, mixed-case plurals (e.g. "Companies") are now singularized with their case shape restored; the previous lower-cased vocabulary silently skipped them. stem_tokens() stems the vocabulary in one char_wordstem() call. Both verbs’ token-type replacements are now exact-match (case_insensitive = FALSE), so each cased variant maps to its own form.
from_json_to_df() parses each batch with a single vectorized RcppSimdJson::fload() call, preallocates its chunk list, drops a forced per-chunk gc(), and no longer copies every parsed table. Output on the bundled 10-K fixtures is byte-identical to 1.1.1.
summarize_topics() computes topic ids and prevalence from compact matrices instead of materializing wide DTW/TWW tables; estimate_stm_topic_effects() tidying calls summary() once for all topics instead of once per topic.
The OpTop bridge targets the current OpTop API (>= 0.19, where optimal_topic() became optop_select() and the grid argument became topic_models). Because OpTop now consumes nlp_topic_fit objects natively - the test touches each model only through its fitted word probabilities - as_optop_input() passes fits through as-is instead of unwrapping raw LDA_VEM objects, and the former VEM-only restriction is lifted: grids may mix fitting methods (and supported classes) provided every model was fitted on the same corpus and vocabulary. The returned object’s grid element is now named topic_models, matching OpTop’s argument; lda_models remains as a deprecated alias for existing scripts.

BUG FIXES

plot_top_terms() no longer mutates the caller’s top_terms table by reference.
define_corpus() anchors the filename-extension strip (\\.(htm|txt)$), so filenames like report.html are no longer mangled by a mid-name match.

DEPRECATIONS

The ncores, nchunks, and socket arguments of the quanteda-backed corpus verbs are deprecated and ignored: supplying them raises a warning of class NLPstudio_deprecated, and ncores is mapped to threads when threads is not given. They remain fully functional on the process-parallel tier (parse_corpus(), from_json_to_df()) and on the model-selection layer (select_k_topics(), assess_topic_stability()), where worker processes are the right tool. Removal of the deprecated arguments is planned for a future release (not before 1.4.0).

NLPstudio 1.1.1 (2026-06-18)

NEW FEATURES

as_nlp_topic_fit() gained return_dtw and return_tww arguments (both default TRUE), mirroring fit_topic_model(). Setting return_tww = FALSE skips materializing the standardized topic-word matrix at adoption, which avoids large memory spikes when converting models with very large vocabularies (for example n-gram models). get_tww() and get_dtw() reconstruct the tables on demand from the retained model object. The flags apply to every adopted backend: topicmodels, seededlda, stm, and text2vec.
get_representative_candidates() gained a top_n argument. When supplied, it keeps only the top_n highest-ranked documents within each dominant topic (default NULL returns every document).

CHANGES

Reduced peak memory when standardizing document-topic (DTW) and topic-word (TWW) matrices during conversion and extraction: the builders now set dimnames in a single copy-on-modify step and coerce storage only when needed, instead of triggering several full copies of large matrices. This also benefits fit_topic_model(), which shares the same builders.
Removed a redundant matrix copy in the ETM topic-word extractor.
get_representative_candidates() now derives each document’s dominant topic directly from the document-topic matrix in O(documents) memory, instead of materializing and copying the full document-by-topic table. Its output is now compact: the documented doc_id, topic_max_*, topic_rank, and candidate_band columns (plus requested metadata/text), without the per-document Topic### distribution columns that were previously included as an undocumented byproduct. This unblocks representative-candidate extraction for fits with very large document counts (e.g. n-gram sequential LDA).

NLPstudio 1.1.0 (2026-06-18)

NEW FEATURES

Added a token-preprocessing layer so common steps stay inside the NLPstudio API instead of dropping down to raw quanteda. All helpers are parallel-aware and follow the package’s ncores/nchunks/socket conventions:
- ngram_tokens() builds n-grams and skip-grams.
- detect_collocations() scores candidate multiword expressions and returns an export-ready table.
- compound_tokens() binds phrases or detected collocations into single tokens, and composes with detect_collocations().
- stem_tokens() applies the Snowball stemmer, with multilingual support.
- lemmatize_tokens() maps tokens to lemmas through a dependency-free lookup map or the optional spacyr backend.
- weight_dfm() applies TF-IDF and the other quanteda weighting schemes.

DOCUMENTATION

Extended the corpus-workflow vignette with a “Preprocessing the Tokens” section demonstrating the new helpers end to end.
Clarified the structural topic model (STM) limitations in the topic-model-api vignette: prevalence covariates are supported, while content covariates and automatic prediction for prevalence fits are intentionally not.

NLPstudio 1.0.2 (2026-06-07)

DOCUMENTATION

Added a new vignette, choosing-k (“Choosing the Number of Topics”): a dedicated guide to selecting the number of topics, covering select_k_topics(), summarize_k_selection(), the selection plot() method, in-depth evaluate_topic_model(), assess_topic_stability(), and the OpTop chi-square test of Lewis and Grossetti (2022). It is worked end-to-end on a real corpus (US presidential inaugural addresses).
Fixed the rendered order of the pkgdown articles so that corpus preparation precedes the topic-model API, followed by the new model-selection vignette.
Trimmed the README so the release-status section no longer duplicates the topic-model output-schema contract already documented in the topic-model-api vignette and in NEWS.
Consolidated the OpTop section of the topic-model-api vignette into a pointer to the new choosing-k vignette, removing the duplicated worked example, and aligned the package-level description wording with DESCRIPTION.

NOTES

Added CITATION.cff to .Rbuildignore so the GitHub/Zenodo citation file is excluded from the build, clearing the two R CMD check NOTEs about a non-standard top-level CITATION file. R’s citation() continues to use inst/CITATION.

NLPstudio 1.0.1 (2026-06-05)

DOCUMENTATION

Audited and corrected inline code formatting across the documentation so that code-like tokens (function calls, argument names, argument values, class names, and column/field names) render as inline code, while narrative prose and conceptual abbreviations are left as regular text. This improves readability of the pkgdown site, particularly on mobile.
Applied the pass consistently to roxygen documentation, the corpus-workflow and topic-model-api vignettes, and the README.

NLPstudio 1.0.0 (2026-06-04)

PUBLIC RELEASE

Released NLPstudio as a stable public toolkit for social science text analysis, corpus management, and topic modeling.
Finalized the public API-stability commitment for the core topic-model output surfaces, including nlp_topic_fit, nlp_k_selection, nlp_k_selection_summary, nlp_topic_stability, topic summaries, STM topic summaries, and STM topic-effect tables.
Updated package metadata, README release language, and vignette stability wording from the v1.0.0-rc1 release candidate to the final v1.0.0 public release.
Confirmed the pkgdown site and citation metadata as part of the public documentation surface. DOI minting remains tied to the GitHub release and Zenodo archive workflow.

NLPstudio 0.99.0 (2026-06-04)

RELEASE CANDIDATE

Prepared the v1.0.0-rc1 pre-release branch. This is not the final public launch; DOI minting and broad release announcements remain reserved for v1.0.0.
Added pkgdown site configuration and a GitHub Pages workflow so the public documentation site can be built and deployed from main.
Rewrote the README as a shorter landing page that points readers to the pkgdown reference and workflow vignettes for function-level detail.
Updated GitHub installation instructions to use pak with pak::pkg_install("contefranz/NLPstudio").
Added package citation metadata for citation("NLPstudio").

NLPstudio 0.9.7 (2026-06-03)

CHANGES

Started the pre-v1.0.0 API-freeze pass with explicit output-contract tests for nlp_topic_fit, nlp_k_selection, nlp_k_selection_summary, nlp_topic_stability, summarize_topics(), summarize_stm_topics(), and estimate_stm_topic_effects().
Resolved issue #17 by retaining standard evaluation and selection columns after post-evaluation workflows. Long-format outputs keep metric, level, topic_id, value, and supported; aggregate rows retain topic_id with NA values instead of dropping the column.
Removed the deprecated metrics = "perplexity" alias from evaluate_topic_model() and select_k_topics() validation. Use the final metric name held_out_perplexity instead.
Triaged issue #14 as a sequential chunked corpus-construction enhancement and closed it as non-blocking for the public API freeze.

DOCUMENTATION

Added public API stability guidance to the README and topic-model API vignette, including the stable output surfaces expected to carry into v1.0.0.

NLPstudio 0.9.6 (2026-06-03)

NEW FEATURES

Added get_stm_topic_labels() to expose STM-native probability, FREX, lift, score, and optional SAGE topic labels as standardized long tables with NLPstudio Topic### identifiers.
Added summarize_stm_topics() to extend summarize_topics() with collapsed STM-native label columns for interpretation and export workflows.
Added estimate_stm_topic_effects() to wrap stm::estimateEffect() and return tidy prevalence-effect coefficient tables while preserving the raw STM effect object as an attribute.

DOCUMENTATION

Updated README, manuals, and the topic-model API vignette with STM interpretation examples and clearer guidance that STM content covariates remain unsupported in v0.9.6.

NLPstudio 0.9.5 (2026-05-22)

NEW FEATURES

Added summarize_k_selection() to convert select_k_topics() output into a wide, export-ready table with one row per candidate topic count.
Added optional OpTop result parsing so externally computed OpTop::optimal_topic() statistics can be merged back into NLPstudio selection summaries without making OpTop a package dependency.

DOCUMENTATION

Updated the README and topic-model API vignette with a reporting workflow that combines NLPstudio selection metrics, stability rows, and optional OpTop statistics for paper-ready model-selection tables.

NLPstudio 0.9.4 (2026-05-21)

NEW FEATURES

Added optional stm backend support to fit_topic_model() with engine = "stm" and model = "stm", including standardized DTW/TWW extraction, top terms, summaries, evaluation, K selection, and stability diagnostics.
Added as_nlp_topic_fit() support for raw stm STM objects that use a single topic-word distribution.

CHANGES

STM prevalence covariates can be supplied through control$fit$prevalence and control$fit$data. When a quanteda DFM has document variables, those docvars are used as STM metadata if explicit data are omitted.
STM content covariates are explicitly rejected in v0.9.4 because they imply covariate-specific topic-word distributions, while NLPstudio currently standardizes one TWW matrix per fit.

DOCUMENTATION

Updated manuals, README, and the topic-model API vignette to document STM prevalence support and the deferred content-covariate design.

NLPstudio 0.9.3 (2026-05-20)

NEW FEATURES

Added as_optop_weighted_dfm() and as_optop_input() so NLPstudio topicmodels LDA VEM fits can be passed to OpTop without refitting or reimplementing OpTop’s C++ routines.

DOCUMENTATION

Added OpTop interoperability examples to the topic-model API documentation and README.

NLPstudio 0.9.2 (2026-05-08)

CHANGES

Internalized the singularization backend used by singularize_tokens(). The function no longer depends on the archived pluralize package while preserving the existing public interface.

DOCUMENTATION

Updated the corpus workflow vignette so singularization is evaluated during vignette builds and described as package-owned functionality.

CI

Removed the GitHub-only pluralize installation from the coverage workflow.

NLPstudio 0.9.1 (2026-05-08)

NEW FEATURES

Expanded as_nlp_topic_fit() so existing topic-model fits from topicmodels, seededlda, raw text2vec WarpLDA/LDA objects, and saved legacy warp_lda() outputs can be adopted as current nlp_topic_fit objects without refitting models.

DOCUMENTATION

Expanded adoption and migration notes in the topic-model API documentation and README.

NLPstudio 0.9.0 (2026-05-04)

NEW FEATURES

Added assess_topic_stability(), a transparent repeated-fit wrapper around fit_topic_model() for scoring topic stability under the same model specification and different seeds.
Added summarize_topics(), a one-row-per-topic interpretation table with top terms, prevalence, available metrics, representative documents, and optional text or document metadata.
Extended select_k_topics() with optional stability_seeds, stability_resampling, and stability_ncores arguments. Existing defaults are unchanged; when seeds are supplied, aggregate stability rows are added and full stability details are attached as an attribute.

DOCUMENTATION

Added a comprehensive topic-model API vignette covering backend portability, standardized DTW/TWW extraction, prediction, evaluation, K selection, stability assessment, topic summarization, ETM extensions, and table export.
Added a corpus workflow vignette using bundled SEC-style 10-K JSON examples to demonstrate JSON ingestion, corpus construction, tokenization, dictionary lookup, readability, similarity/distance, and table export.
Added five public 10-K JSON example files under inst/extdata/json/ for vignette and example workflows.
Updated visible README release references and the topic-model workflow for v0.9.0.

CI

Removed the dedicated URL-check workflow because external link and DNS failures made it too fragile for merge gating.

NLPstudio 0.8.4 (2026-05-04)

CI

Expanded R CMD check coverage to include Ubuntu oldrel-1, release, and devel, while keeping Windows and macOS on release R.
Added a roxygen consistency workflow using roxygen2 >= 8.0.0 so stale NAMESPACE and man/*.Rd files fail CI.
Added a strict URL check workflow for README, package, and Rd links.

DOCUMENTATION

Refreshed the README release badge to v0.8.4 and linked it to the generic releases page until the release tag exists.
Added a reproducible in-memory topicmodels workflow to the README for the current v0.8.x topic-model API.
Normalized several documentation references to stable article URLs or plain DOI citations so automated URL checks can run consistently.

NLPstudio 0.8.3 (2026-05-04)

TESTS

Raised local line coverage above 95% with focused regression coverage for topic-model internals, JSON ingestion edge cases, tokenizer branching, selection summaries, and evaluation helpers.
Added lightweight synthetic ETM coverage for package-owned accessor and helper behavior without requiring the optional topicmodels.etm or torch backends in CI.

CI

Enforced a 95% Codecov project coverage target with a 1% tolerance while keeping patch coverage informational.

DOCUMENTATION

Added developer coverage instructions so local tests and coverage can be reproduced before pushing.

BUG FIXES

calculate_similarity() and calculate_distance() now preserve backend default method and margin metadata when callers rely on default quanteda.textstats arguments.

NLPstudio 0.8.2 (2026-04-29)

TESTS

Expanded topic API stability coverage for vocabulary alignment, ETM pruning, public topic selectors, legacy DTW/TWW coercion, and unsupported backend combinations.
Added regression coverage for user-facing topic-model warnings and errors while keeping this release behavior-preserving.

NLPstudio 0.8.1 (2026-04-29)

BREAKING CHANGES

Removed the deprecated set_ff_industries() API. Fama-French industry mapping is outside the current package scope and should be performed upstream before corpus analysis.

TESTS

Added focused coverage for under-tested public helpers and compact internal contracts supporting topic-model output handling.

NLPstudio 0.8.0 (2026-04-28)

NEW FEATURES

Added evaluate_topic_model(), a unified interface for evaluating fitted topic models across engines. It reports aggregate and optional topic-level diagnostics for corpus fit, topic structure, coherence, diversity, and training or held-out likelihood metrics in a standardized long-format table.
Added select_k_topics(), a K-grid search helper that fits, evaluates, prints, and plots candidate topic models, with optional document-level holdout splits and parallel execution.
Added get_topic_hyperparameters() and stored standardized topic-model hyperparameters on nlp_topic_fit objects. The accessor exposes topic count (k), alpha, and beta with backend-native source metadata, while sanitized backend controls remain available on the fit object.

NLPstudio 0.7.0 (2026-04-24)

NEW FEATURES

Added predict_topic_model(), a generic post-fit prediction interface that aligns new data to the fitted vocabulary and returns standardized DTW tables across text2vec, topicmodels, seededlda, and topicmodels.etm.
Extended the nlp_topic_fit object contract with a stored vocab field so prediction and downstream helpers no longer depend on cached TWW to recover fitted term order.
Added get_topic_embeddings() and get_term_embeddings() for topicmodels.etm, exposing ETM topic-center and term embeddings in a standardized data.table format.
Added plot_topic_embeddings(), an ETM-specific visualization that uses the backend UMAP summary path to display topic centers and their top associated words in two dimensions.
Post-fit document-level topic-model helpers now omit docvars by default. Use docvars = TRUE in get_dtw(), get_representative_candidates(), or predict_topic_model() when enriched outputs should include available document variables. Existing non-topic metadata columns in standardized DTW table inputs are also retained only when docvars = TRUE. get_representative_candidates() also omits columns matching stored docvar names when docvars = FALSE, even if those names arrive through doc_data.
Document-level topic-model outputs now use a stable column order: doc_id, document metadata, function output columns, and optional text as the final column.
Document-level topic-model outputs now include topic_max_int, the integer topic number corresponding to topic_max_id.

CHANGES

stringr has been removed from Imports. The three internal call sites in define_corpus() and singularize_tokens() have been rewritten in base R (sub(), paste(), grepl()), trimming a transitive dependency chain without changing behaviour. The startup message has been updated accordingly.
text2vec has been moved from Imports to Suggests, bringing it in line with the other topic-model backends (topicmodels, seededlda, topicmodels.etm). fit_topic_model(engine = "text2vec") now emits an informative error if text2vec is not installed. Users who rely on the text2vec engine should install it explicitly: install.packages("text2vec"). The startup message now lists text2vec under optional backends.

NLPstudio 0.6.1 (2026-04-23)

CHANGES

fit_topic_model() now uses a single control = list(model = ..., fit = ..., optimizer = ...) argument instead of separate model_control and fit_control inputs.
The returned nlp_topic_fit object now stores compact docvars, optional doc_data, fitted doc_ids, and matrix-backed DTW/TWW caches instead of retaining the raw modeling input.
get_dtw() and get_representative_candidates() now align post-fit outputs through fitted doc_id values, auto-join stored docvars, and use doc_data only for explicit metadata or text enrichment.
print.nlp_topic_fit() now prints a compact summary so large topic-model fits can be inspected at the console without expanding huge internals.
warp_lda() has been removed from the package surface. Text2vec support is now available only through fit_topic_model(engine = "text2vec", model = "lda").
fit_topic_model() now supports embedded topic models via engine = "topicmodels.etm", model = "etm", with ETM controls routed through control$model, control$fit, and control$optimizer.

NLPstudio 0.6.0 (2026-04-22)

BREAKING CHANGES

warpLDA() has been removed from the public API.

NEW FEATURES

Added fit_topic_model(), a unified topic-model fitting interface across text2vec, topicmodels, and seededlda.
Added get_dtw() and get_tww() to standardize document-topic weights (DTW) and topic-word weights (TWW) using the Topic### naming convention.
Added get_representative_candidates() to extract dominant-topic candidates and band them within topic using quantile or deterministic rank-based fallback rules.

CHANGES

get_top_terms() and plot_dtw() now route through the standardized DTW/TWW extractor layer instead of backend-specific logic.
Text2vec topic modeling is routed through fit_topic_model() using engine = "text2vec", model = "lda".
Package documentation now uses DTW/TWW terminology following Lewis and Grossetti (2022) and documents the returned nlp_topic_fit S3 wrapper.

NLPstudio 0.5.1 (2026-04-22)

NOTES

Added guarded copy-paste examples for the remaining exported, supported functions that previously lacked them. The new examples are written with @examplesIf interactive() so they document intended usage without being executed during package checks.
The examples release intentionally excludes deprecated set_ff_industries() and does not revisit APIs removed in v0.5.0.

NLPstudio 0.5.0 (2026-04-21)

BREAKING CHANGES

get_json_files() has been removed. Users should now discover JSON inputs directly with list.files(..., pattern = "\\.json$", recursive = TRUE, full.names = TRUE) and pass the resulting character vector to from_json_to_df().
get_sec_master_files() has been removed. SEC master-file ingestion is now considered outside the current NLPstudio scope and should be handled upstream before the data enters the package workflow.

DEPRECATIONS

set_ff_industries() is now soft-deprecated. The function remains exported and functional in v0.5.0, but it emits a deprecation warning and is planned for removal in a future release. Fama-French industry mapping is now treated as an upstream preprocessing step rather than part of the core package API.

NOTES

This release intentionally does not include the examples expansion planned for a follow-up documentation-focused release.

NLPstudio 0.4.1 (2026-04-21)

BREAKING CHANGES

library(NLPstudio) no longer attaches quanteda, quanteda.textstats, data.table, text2vec, or stringr to the search path. Those packages remain in Imports and are fully available inside the package, but users who relied on the implicit attachment for their own code will need to add explicit library() calls. A startup message now states the version and lists the required packages.

Why this changed. The previous behaviour followed the meta-package pattern popularised by the tidyverse: loading one package silently attaches several others. This is convenient at the console but has meaningful costs when NLPstudio is used as a library dependency rather than an interactive toolkit:
- Search-path pollution. Every attached package adds a frame to the search path. Name collisions become more likely as the path grows — for instance, data.table::between() and dplyr::between() resolve differently depending on attachment order, producing bugs that are hard to trace.
- Opacity for downstream packages. A package that Imports NLPstudio unintentionally acquires five additional namespaces on the search path, which can mask functions in its own dependencies without any explicit declaration in its DESCRIPTION.
- Redundancy. Since v0.3.3 every call inside NLPstudio uses fully qualified pkg::function() notation. The package does not need any of these namespaces attached in order to work; it only needs them loaded, which Imports already guarantees.
Users who want the packages attached for interactive work can add library(quanteda); library(data.table) etc. to their own scripts or .Rprofile. Nothing changes for code that already calls those packages explicitly.

NEW FEATURES

define_corpus() gains a default S3 method that produces an informative error when the input is not a data.table, replacing the opaque “no applicable method” dispatch failure.

BUG FIXES

GitHub Actions R-CMD-check now passes reliably across the supported CI environments. Internal PSOCK execution now falls back to sequential processing when a worker socket cannot be created, which avoids environment-specific failures without changing the public API.
Optional helper packages used only inside specific functions are no longer installed during CI. In particular, pluralize and farr have been removed from Suggests, while singularize_tokens() and set_ff_industries() continue to emit explicit runtime errors when those packages are not installed by the user.

NOTES

Golden tests added for all parallel functions: tokenize_corpus(), calculate_readability(), summarize_corpus(), and reshape_corpus() now each include a test asserting that ncores = 2 produces numerically identical output to ncores = 1. The class of silent parallelization bug that affected calculate_similarity() in v0.2.x would be caught immediately across any of these functions.
Contract tests added for define_corpus() (missing columns individually and in combination, non-data.table input, duplicate doc-ID warning, no temp-column leakage into the input data.table) and warp_lda() (argument routing via positive contracts: valid fit_control args, valid lda_control args, topic count k not overridable via lda_control, return_theta/return_phi flags).
Test count: 96 (up from 66 in v0.3.x).
The package now includes a standard GitHub Actions R-CMD-check workflow and matching README badge.
Roxygen comments were normalized toward Markdown-style notation and the generated documentation was refreshed. Mathematical notation remains in Rd form where appropriate (for example \eqn{}).

NLPstudio 0.3.3 (2026-04-18)

NOTES

Every external function call is now fully namespace-qualified (pkg::function()) throughout all source files. No bare unqualified calls remain for any imported package. This makes dependency resolution unambiguous and removes the need for @importFrom roxygen tags.
All @importFrom tags have been removed from every .R file. The only whole-package imports that remain are @import data.table (required for the := and .() special syntax) and @import ggplot2 (required for + operator dispatch on ggplot objects). The generated NAMESPACE is correspondingly minimal.
parallel has been removed from Imports in DESCRIPTION. parallel is a base R package that ships with every R installation; declaring it in Imports alongside R (>= 4.3) was redundant.

NLPstudio 0.3.2 (2026-04-17)

BUG FIXES

calculate_similarity() / calculate_distance(): quanteda_options("threads") returns a scalar, not a named list — accessing it as $threads raised an error on every call.
warp_lda(): constructor args and fitting args were both routed through ..., causing “unused argument” errors. Replaced with lda_control and fit_control named lists.
define_corpus(): item was used without being validated, causing cryptic downstream errors when the column was absent.
calculate_readability(): bare is.corpus() / corpus() calls relied on search-path attachment not guaranteed inside a package namespace.
from_json_to_df(): setcolorder(..., after = "filing_type") would error when filing_type was absent.

NEW FEATURES

from_json_to_df(): max_chunk_size promoted from a hidden ... argument to a proper named parameter.

NOTES

Dead code removed: %||% operator and is.textstat_simil_symm() from R/utils.R.
Stale globalVariables("doc_id") removed from R/tokenize_corpus.R.
cli_h2() and cli_alert_success() added to sequential paths of tokenize_corpus(), summarize_corpus(), lookup_tokens(), reshape_corpus().

NLPstudio 0.3.1 (2026-04-16)

BUG FIXES

summarize_corpus(): sequential path returned column document while parallel path returned doc_id. Both paths now rename consistently.

NLPstudio 0.3.0 (2026-04-15)

NEW FEATURES

Unified parallel backend. .run_parallel() and .validate_parallel_args() (in R/utils.R) encapsulate all PSOCK/FORK branching, eliminating ~120 lines of duplicated boilerplate across every parallel function.
calculate_similarity() / calculate_distance() rewritten. The previous row-split approach produced a block-diagonal result (cross-chunk pairs were never evaluated). Replaced with quanteda’s built-in OpenMP threading via quanteda_options(threads = ncores).
Sequential fast paths added to all parallel functions — cluster creation is bypassed entirely when ncores < 2.
Testing infrastructure added (tests/testthat/, 3rd edition, 66 tests).
warp_lda() (snake_case) introduced as canonical name; warpLDA() retained as a deprecated alias.

BUG FIXES

calculate_similarity() / calculate_distance(): temp_matrix undefined when y provided.
get_sec_master_files(): uniqueN() called on a list instead of the bound data.table.
parse_corpus(): on.exit(spacy_finalize) registered too late — moved to immediately after acquiring the function reference.

NOTES

Breaking: future and future.apply removed from Imports; parallel (base R) used instead.
glue removed from Imports.
warpLDA() deprecated; will be removed in a future release.

NLPstudio 0.2.0 (2025-10-01)

NOTES

Dependency overhaul: topicmodels moved to Suggests; Imports entries sorted alphabetically.
Package logo updated.

NLPstudio 0.1.5 (2025-09-30)

NEW FEATURES

from_json_to_df() refactored with internal helpers; JSON parsing switched from jsonlite to RcppSimdJson::fload().
Dynamic PSOCK scheduling via clusterApplyLB() across all four corpus-parallel functions.

NOTES

foreach removed from Imports.

NLPstudio 0.1.0 (2025-07-31)

NEW FEATURES

get_top_terms() — extracts top-n terms from φ in long or wide format.
plot_top_terms() — faceted bar chart of per-topic top terms.

NLPstudio 0.0.7 (2025-07-29)

NEW FEATURES

warpLDA() — WarpLDA topic model via text2vec; returns θ, φ, and the model object.
plot_dtw() — faceted histogram of document-topic weight distributions.

NLPstudio 0.0.6 (2025-02-23)

NEW FEATURES

get_sec_master_files() — reads and normalises SEC EDGAR master CSV files.

NOTES

Documentation switched to roxygen2 Markdown rendering.

NLPstudio 0.0.5 (2024-04-19)

NEW FEATURES

summarize_corpus() — parallel corpus summarisation via textstat_summary().

NLPstudio 0.0.4 (2024-04-18)

NEW FEATURES

singularize_tokens() — parallel plural-to-singular token conversion via pluralize.
Package hex logo added.

NLPstudio 0.0.3 (2024-04-13)

NOTES

Structured console output via cli added across all functions.

NLPstudio 0.0.2 (2024-04-09)

NOTES

Minimum quanteda version raised to >= 4.0.1.

NLPstudio 0.0.1 (2024-04-08)

NEW FEATURES

First public release. Core functions: from_json_to_df(), define_corpus(), tokenize_corpus(), reshape_corpus(), lookup_tokens(), parse_corpus(), calculate_readability(), calculate_similarity(), calculate_distance(), set_ff_industries(), get_json_files() (deprecated v0.1.3). Bundled financial text dictionaries.