NLPstudio 1.0.1 (2026-06-05)
DOCUMENTATION
Audited and corrected inline code formatting across the documentation so that code-like tokens (function calls, argument names, argument values, class names, and column/field names) render as inline code, while narrative prose and conceptual abbreviations are left as regular text. This improves readability of the pkgdown site, particularly on mobile.
Applied the pass consistently to roxygen documentation, the
corpus-workflowandtopic-model-apivignettes, and the README.
NLPstudio 1.0.0 (2026-06-04)
PUBLIC RELEASE
Released NLPstudio as a stable public toolkit for social science text analysis, corpus management, and topic modeling.
Finalized the public API-stability commitment for the core topic-model output surfaces, including
nlp_topic_fit,nlp_k_selection,nlp_k_selection_summary,nlp_topic_stability, topic summaries, STM topic summaries, and STM topic-effect tables.Updated package metadata, README release language, and vignette stability wording from the
v1.0.0-rc1release candidate to the finalv1.0.0public release.Confirmed the pkgdown site and citation metadata as part of the public documentation surface. DOI minting remains tied to the GitHub release and Zenodo archive workflow.
NLPstudio 0.99.0 (2026-06-04)
RELEASE CANDIDATE
Prepared the
v1.0.0-rc1pre-release branch. This is not the final public launch; DOI minting and broad release announcements remain reserved forv1.0.0.Added pkgdown site configuration and a GitHub Pages workflow so the public documentation site can be built and deployed from
main.Rewrote the README as a shorter landing page that points readers to the pkgdown reference and workflow vignettes for function-level detail.
Updated GitHub installation instructions to use pak with
pak::pkg_install("contefranz/NLPstudio").Added package citation metadata for
citation("NLPstudio").
NLPstudio 0.9.7 (2026-06-03)
CHANGES
Started the pre-
v1.0.0API-freeze pass with explicit output-contract tests fornlp_topic_fit,nlp_k_selection,nlp_k_selection_summary,nlp_topic_stability,summarize_topics(),summarize_stm_topics(), andestimate_stm_topic_effects().Resolved issue #17 by retaining standard evaluation and selection columns after post-evaluation workflows. Long-format outputs keep
metric,level,topic_id,value, andsupported; aggregate rows retaintopic_idwithNAvalues instead of dropping the column.Removed the deprecated
metrics = "perplexity"alias fromevaluate_topic_model()andselect_k_topics()validation. Use the final metric nameheld_out_perplexityinstead.Triaged issue #14 as a sequential chunked corpus-construction enhancement and closed it as non-blocking for the public API freeze.
NLPstudio 0.9.6 (2026-06-03)
NEW FEATURES
Added
get_stm_topic_labels()to expose STM-native probability, FREX, lift, score, and optional SAGE topic labels as standardized long tables with NLPstudioTopic###identifiers.Added
summarize_stm_topics()to extendsummarize_topics()with collapsed STM-native label columns for interpretation and export workflows.Added
estimate_stm_topic_effects()to wrapstm::estimateEffect()and return tidy prevalence-effect coefficient tables while preserving the raw STM effect object as an attribute.
NLPstudio 0.9.5 (2026-05-22)
NEW FEATURES
Added
summarize_k_selection()to convertselect_k_topics()output into a wide, export-ready table with one row per candidate topic count.Added optional OpTop result parsing so externally computed
OpTop::optimal_topic()statistics can be merged back into NLPstudio selection summaries without making OpTop a package dependency.
NLPstudio 0.9.4 (2026-05-21)
NEW FEATURES
Added optional stm backend support to
fit_topic_model()withengine = "stm"andmodel = "stm", including standardized DTW/TWW extraction, top terms, summaries, evaluation, K selection, and stability diagnostics.Added
as_nlp_topic_fit()support for raw stmSTMobjects that use a single topic-word distribution.
CHANGES
STM prevalence covariates can be supplied through
control$fit$prevalenceandcontrol$fit$data. When aquantedaDFM has document variables, those docvars are used as STM metadata if explicit data are omitted.STM content covariates are explicitly rejected in
v0.9.4because they imply covariate-specific topic-word distributions, while NLPstudio currently standardizes one TWW matrix per fit.
NLPstudio 0.9.3 (2026-05-20)
NEW FEATURES
- Added
as_optop_weighted_dfm()andas_optop_input()so NLPstudiotopicmodelsLDA VEM fits can be passed to OpTop without refitting or reimplementing OpTop’s C++ routines.
NLPstudio 0.9.2 (2026-05-08)
CHANGES
- Internalized the singularization backend used by
singularize_tokens(). The function no longer depends on the archived pluralize package while preserving the existing public interface.
NLPstudio 0.9.1 (2026-05-08)
NEW FEATURES
- Expanded
as_nlp_topic_fit()so existing topic-model fits from topicmodels, seededlda, raw text2vec WarpLDA/LDA objects, and saved legacywarp_lda()outputs can be adopted as currentnlp_topic_fitobjects without refitting models.
NLPstudio 0.9.0 (2026-05-04)
NEW FEATURES
Added
assess_topic_stability(), a transparent repeated-fit wrapper aroundfit_topic_model()for scoring topic stability under the same model specification and different seeds.Added
summarize_topics(), a one-row-per-topic interpretation table with top terms, prevalence, available metrics, representative documents, and optional text or document metadata.Extended
select_k_topics()with optionalstability_seeds,stability_resampling, andstability_ncoresarguments. Existing defaults are unchanged; when seeds are supplied, aggregate stability rows are added and full stability details are attached as an attribute.
DOCUMENTATION
Added a comprehensive topic-model API vignette covering backend portability, standardized DTW/TWW extraction, prediction, evaluation, K selection, stability assessment, topic summarization, ETM extensions, and table export.
Added a corpus workflow vignette using bundled SEC-style 10-K JSON examples to demonstrate JSON ingestion, corpus construction, tokenization, dictionary lookup, readability, similarity/distance, and table export.
Added five public 10-K JSON example files under
inst/extdata/json/for vignette and example workflows.Updated visible README release references and the topic-model workflow for
v0.9.0.
NLPstudio 0.8.4 (2026-05-04)
CI
Expanded R CMD check coverage to include Ubuntu
oldrel-1,release, anddevel, while keeping Windows and macOS on release R.Added a roxygen consistency workflow using
roxygen2 >= 8.0.0so staleNAMESPACEandman/*.Rdfiles fail CI.Added a strict URL check workflow for README, package, and Rd links.
DOCUMENTATION
Refreshed the README release badge to
v0.8.4and linked it to the generic releases page until the release tag exists.Added a reproducible in-memory
topicmodelsworkflow to the README for the current v0.8.x topic-model API.Normalized several documentation references to stable article URLs or plain DOI citations so automated URL checks can run consistently.
NLPstudio 0.8.3 (2026-05-04)
TESTS
Raised local line coverage above 95% with focused regression coverage for topic-model internals, JSON ingestion edge cases, tokenizer branching, selection summaries, and evaluation helpers.
Added lightweight synthetic ETM coverage for package-owned accessor and helper behavior without requiring the optional
topicmodels.etmortorchbackends in CI.
CI
- Enforced a 95% Codecov project coverage target with a 1% tolerance while keeping patch coverage informational.
DOCUMENTATION
- Added developer coverage instructions so local tests and coverage can be reproduced before pushing.
BUG FIXES
-
calculate_similarity()andcalculate_distance()now preserve backend defaultmethodandmarginmetadata when callers rely on default quanteda.textstats arguments.
NLPstudio 0.8.1 (2026-04-29)
NLPstudio 0.8.0 (2026-04-28)
NEW FEATURES
Added
evaluate_topic_model(), a unified interface for evaluating fitted topic models across engines. It reports aggregate and optional topic-level diagnostics for corpus fit, topic structure, coherence, diversity, and training or held-out likelihood metrics in a standardized long-format table.Added
select_k_topics(), a K-grid search helper that fits, evaluates, prints, and plots candidate topic models, with optional document-level holdout splits and parallel execution.Added
get_topic_hyperparameters()and stored standardized topic-model hyperparameters onnlp_topic_fitobjects. The accessor exposes topic count (k),alpha, andbetawith backend-native source metadata, while sanitized backend controls remain available on the fit object.
NLPstudio 0.7.0 (2026-04-24)
NEW FEATURES
Added
predict_topic_model(), a generic post-fit prediction interface that aligns new data to the fitted vocabulary and returns standardized DTW tables across text2vec, topicmodels, seededlda, and topicmodels.etm.Extended the
nlp_topic_fitobject contract with a storedvocabfield so prediction and downstream helpers no longer depend on cached TWW to recover fitted term order.Added
get_topic_embeddings()andget_term_embeddings()for topicmodels.etm, exposing ETM topic-center and term embeddings in a standardizeddata.tableformat.Added
plot_topic_embeddings(), an ETM-specific visualization that uses the backend UMAP summary path to display topic centers and their top associated words in two dimensions.Post-fit document-level topic-model helpers now omit docvars by default. Use
docvars = TRUEinget_dtw(),get_representative_candidates(), orpredict_topic_model()when enriched outputs should include available document variables. Existing non-topic metadata columns in standardized DTW table inputs are also retained only whendocvars = TRUE.get_representative_candidates()also omits columns matching stored docvar names whendocvars = FALSE, even if those names arrive throughdoc_data.Document-level topic-model outputs now use a stable column order:
doc_id, document metadata, function output columns, and optionaltextas the final column.Document-level topic-model outputs now include
topic_max_int, the integer topic number corresponding totopic_max_id.
CHANGES
stringrhas been removed fromImports. The three internal call sites indefine_corpus()andsingularize_tokens()have been rewritten in base R (sub(),paste(),grepl()), trimming a transitive dependency chain without changing behaviour. The startup message has been updated accordingly.text2vechas been moved fromImportstoSuggests, bringing it in line with the other topic-model backends (topicmodels,seededlda,topicmodels.etm).fit_topic_model(engine = "text2vec")now emits an informative error iftext2vecis not installed. Users who rely on the text2vec engine should install it explicitly:install.packages("text2vec"). The startup message now liststext2vecunder optional backends.
NLPstudio 0.6.1 (2026-04-23)
CHANGES
fit_topic_model()now uses a singlecontrol = list(model = ..., fit = ..., optimizer = ...)argument instead of separatemodel_controlandfit_controlinputs.The returned
nlp_topic_fitobject now stores compactdocvars, optionaldoc_data, fitteddoc_ids, and matrix-backed DTW/TWW caches instead of retaining the raw modeling input.get_dtw()andget_representative_candidates()now align post-fit outputs through fitteddoc_idvalues, auto-join stored docvars, and usedoc_dataonly for explicit metadata or text enrichment.print.nlp_topic_fit()now prints a compact summary so large topic-model fits can be inspected at the console without expanding huge internals.warp_lda()has been removed from the package surface. Text2vec support is now available only throughfit_topic_model(engine = "text2vec", model = "lda").fit_topic_model()now supports embedded topic models viaengine = "topicmodels.etm", model = "etm", with ETM controls routed throughcontrol$model,control$fit, andcontrol$optimizer.
NLPstudio 0.6.0 (2026-04-22)
NEW FEATURES
Added
fit_topic_model(), a unified topic-model fitting interface across text2vec, topicmodels, and seededlda.Added
get_dtw()andget_tww()to standardize document-topic weights (DTW) and topic-word weights (TWW) using theTopic###naming convention.Added
get_representative_candidates()to extract dominant-topic candidates and band them within topic using quantile or deterministic rank-based fallback rules.
CHANGES
get_top_terms()andplot_dtw()now route through the standardized DTW/TWW extractor layer instead of backend-specific logic.Text2vec topic modeling is routed through
fit_topic_model()usingengine = "text2vec", model = "lda".Package documentation now uses DTW/TWW terminology following Lewis and Grossetti (2022) and documents the returned
nlp_topic_fitS3 wrapper.
NLPstudio 0.5.1 (2026-04-22)
NOTES
Added guarded copy-paste examples for the remaining exported, supported functions that previously lacked them. The new examples are written with
@examplesIf interactive()so they document intended usage without being executed during package checks.The examples release intentionally excludes deprecated
set_ff_industries()and does not revisit APIs removed in v0.5.0.
NLPstudio 0.5.0 (2026-04-21)
BREAKING CHANGES
get_json_files()has been removed. Users should now discover JSON inputs directly withlist.files(..., pattern = "\\.json$", recursive = TRUE, full.names = TRUE)and pass the resulting character vector tofrom_json_to_df().get_sec_master_files()has been removed. SEC master-file ingestion is now considered outside the currentNLPstudioscope and should be handled upstream before the data enters the package workflow.
DEPRECATIONS
-
set_ff_industries()is now soft-deprecated. The function remains exported and functional in v0.5.0, but it emits a deprecation warning and is planned for removal in a future release. Fama-French industry mapping is now treated as an upstream preprocessing step rather than part of the core package API.
NLPstudio 0.4.1 (2026-04-21)
BREAKING CHANGES
-
library(NLPstudio)no longer attachesquanteda,quanteda.textstats,data.table,text2vec, orstringrto the search path. Those packages remain inImportsand are fully available inside the package, but users who relied on the implicit attachment for their own code will need to add explicitlibrary()calls. A startup message now states the version and lists the required packages.Why this changed. The previous behaviour followed the meta-package pattern popularised by the tidyverse: loading one package silently attaches several others. This is convenient at the console but has meaningful costs when NLPstudio is used as a library dependency rather than an interactive toolkit:
Search-path pollution. Every attached package adds a frame to the search path. Name collisions become more likely as the path grows — for instance,
data.table::between()anddplyr::between()resolve differently depending on attachment order, producing bugs that are hard to trace.Opacity for downstream packages. A package that
ImportsNLPstudio unintentionally acquires five additional namespaces on the search path, which can mask functions in its own dependencies without any explicit declaration in itsDESCRIPTION.Redundancy. Since v0.3.3 every call inside NLPstudio uses fully qualified
pkg::function()notation. The package does not need any of these namespaces attached in order to work; it only needs them loaded, whichImportsalready guarantees.
Users who want the packages attached for interactive work can add
library(quanteda); library(data.table)etc. to their own scripts or.Rprofile. Nothing changes for code that already calls those packages explicitly.
NEW FEATURES
-
define_corpus()gains adefaultS3 method that produces an informative error when the input is not adata.table, replacing the opaque “no applicable method” dispatch failure.
BUG FIXES
GitHub Actions
R-CMD-checknow passes reliably across the supported CI environments. Internal PSOCK execution now falls back to sequential processing when a worker socket cannot be created, which avoids environment-specific failures without changing the public API.Optional helper packages used only inside specific functions are no longer installed during CI. In particular,
pluralizeandfarrhave been removed fromSuggests, whilesingularize_tokens()andset_ff_industries()continue to emit explicit runtime errors when those packages are not installed by the user.
NOTES
Golden tests added for all parallel functions:
tokenize_corpus(),calculate_readability(),summarize_corpus(), andreshape_corpus()now each include a test asserting thatncores = 2produces numerically identical output toncores = 1. The class of silent parallelization bug that affectedcalculate_similarity()in v0.2.x would be caught immediately across any of these functions.Contract tests added for
define_corpus()(missing columns individually and in combination, non-data.tableinput, duplicate doc-ID warning, no temp-column leakage into the inputdata.table) andwarp_lda()(argument routing via positive contracts: validfit_controlargs, validlda_controlargs, topic countknot overridable vialda_control,return_theta/return_phiflags).Test count: 96 (up from 66 in v0.3.x).
The package now includes a standard GitHub Actions
R-CMD-checkworkflow and matching README badge.Roxygen comments were normalized toward Markdown-style notation and the generated documentation was refreshed. Mathematical notation remains in Rd form where appropriate (for example
\eqn{}).
NLPstudio 0.3.3 (2026-04-18)
NOTES
Every external function call is now fully namespace-qualified (
pkg::function()) throughout all source files. No bare unqualified calls remain for any imported package. This makes dependency resolution unambiguous and removes the need for@importFromroxygen tags.All
@importFromtags have been removed from every.Rfile. The only whole-package imports that remain are@import data.table(required for the:=and.()special syntax) and@import ggplot2(required for+operator dispatch on ggplot objects). The generatedNAMESPACEis correspondingly minimal.parallelhas been removed fromImportsinDESCRIPTION.parallelis a base R package that ships with every R installation; declaring it inImportsalongsideR (>= 4.3)was redundant.
NLPstudio 0.3.2 (2026-04-17)
BUG FIXES
calculate_similarity()/calculate_distance():quanteda_options("threads")returns a scalar, not a named list — accessing it as$threadsraised an error on every call.warp_lda(): constructor args and fitting args were both routed through..., causing “unused argument” errors. Replaced withlda_controlandfit_controlnamed lists.define_corpus():itemwas used without being validated, causing cryptic downstream errors when the column was absent.calculate_readability(): bareis.corpus()/corpus()calls relied on search-path attachment not guaranteed inside a package namespace.from_json_to_df():setcolorder(..., after = "filing_type")would error whenfiling_typewas absent.
NEW FEATURES
-
from_json_to_df():max_chunk_sizepromoted from a hidden...argument to a proper named parameter.
NOTES
- Dead code removed:
%||%operator andis.textstat_simil_symm()fromR/utils.R. - Stale
globalVariables("doc_id")removed fromR/tokenize_corpus.R. -
cli_h2()andcli_alert_success()added to sequential paths oftokenize_corpus(),summarize_corpus(),lookup_tokens(),reshape_corpus().
NLPstudio 0.3.1 (2026-04-16)
BUG FIXES
-
summarize_corpus(): sequential path returned columndocumentwhile parallel path returneddoc_id. Both paths now rename consistently.
NLPstudio 0.3.0 (2026-04-15)
NEW FEATURES
Unified parallel backend.
.run_parallel()and.validate_parallel_args()(inR/utils.R) encapsulate all PSOCK/FORK branching, eliminating ~120 lines of duplicated boilerplate across every parallel function.calculate_similarity()/calculate_distance()rewritten. The previous row-split approach produced a block-diagonal result (cross-chunk pairs were never evaluated). Replaced with quanteda’s built-in OpenMP threading viaquanteda_options(threads = ncores).Sequential fast paths added to all parallel functions — cluster creation is bypassed entirely when
ncores < 2.Testing infrastructure added (
tests/testthat/, 3rd edition, 66 tests).warp_lda()(snake_case) introduced as canonical name;warpLDA()retained as a deprecated alias.
BUG FIXES
-
calculate_similarity()/calculate_distance():temp_matrixundefined whenyprovided. -
get_sec_master_files():uniqueN()called on a list instead of the bound data.table. -
parse_corpus():on.exit(spacy_finalize)registered too late — moved to immediately after acquiring the function reference.
NLPstudio 0.1.5 (2025-09-30)
NEW FEATURES
-
from_json_to_df()refactored with internal helpers; JSON parsing switched fromjsonlitetoRcppSimdJson::fload(). - Dynamic PSOCK scheduling via
clusterApplyLB()across all four corpus-parallel functions.
NLPstudio 0.1.0 (2025-07-31)
NEW FEATURES
-
get_top_terms()— extracts top-n terms from φ in long or wide format. -
plot_top_terms()— faceted bar chart of per-topic top terms.
NLPstudio 0.0.7 (2025-07-29)
NEW FEATURES
-
warpLDA()— WarpLDA topic model via text2vec; returns θ, φ, and the model object. -
plot_dtw()— faceted histogram of document-topic weight distributions.
NLPstudio 0.0.4 (2024-04-18)
NEW FEATURES
-
singularize_tokens()— parallel plural-to-singular token conversion via pluralize. - Package hex logo added.
NLPstudio 0.0.1 (2024-04-08)
NEW FEATURES
First public release. Core functions: from_json_to_df(), define_corpus(), tokenize_corpus(), reshape_corpus(), lookup_tokens(), parse_corpus(), calculate_readability(), calculate_similarity(), calculate_distance(), set_ff_industries(), get_json_files() (deprecated v0.1.3). Bundled financial text dictionaries.
