Skip to contents

NLPstudio is designed for workflows where text starts outside R: scraped filings, structured JSON records, large corpora, and document metadata that needs to stay aligned with the text. The topic-model vignette starts after the modeling input is already prepared. This vignette covers the earlier part of the workflow: getting filing text into a reliable corpus and producing common document-level features.

The examples use five bundled 10-K JSON filings. They are small enough to ship with the package but realistic enough to show the intended data path. Each JSON file contains filing metadata and top-level item_* fields, such as item_1, item_1A, and item_7.

The JSON import chunks require the optional RcppSimdJson package. If it is not installed, the vignette still knits and shows the code, but the JSON-backed chunks are not evaluated.

Locate the Example JSON Files

The example files live in inst/extdata/json/ in the source package. Once the package is installed, they are available through system.file().

library(NLPstudio)
#> ── NLPstudio 1.0.1 ────────────────── https://github.com/contefranz/NLPstudio ──
#> Core imports: cli, data.table, ggplot2, Matrix, methods, quanteda,
#>               quanteda.textstats
#> Optional backends: text2vec, topicmodels, seededlda, stm,
#>                    topicmodels.etm, torch, spacyr, tidytext,
#>                    RcppSimdJson, uwot
#> Use library(<pkg>) to attach any of these to your session.
#> Optional packages are only needed for the functions that use them.
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:base':
#> 
#>     %notin%
library(quanteda)
#> Package version: 4.4
#> Unicode version: 15.1
#> ICU version: 74.2
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.

json_dir <- system.file("extdata", "json", package = "NLPstudio")
if (!nzchar(json_dir)) {
  json_dir <- file.path("..", "inst", "extdata", "json")
}

json_files <- list.files(
  json_dir,
  pattern = "_10K_.*\\.json$",
  full.names = TRUE
)

basename(json_files)
#> [1] "1750_10K_2025_0001410578-25-001475.json"
#> [2] "1800_10K_2024_0001628280-25-007110.json"
#> [3] "1961_10K_2024_0001264931-25-000008.json"
#> [4] "2098_10K_2024_0000950170-25-034843.json"
#> [5] "2186_10K_2024_0001437749-25-009464.json"

Flatten JSON to a Filing-Item Table

from_json_to_df() reads the JSON files, identifies the 10-K item fields, and returns one row per filing item. Metadata such as CIK, company name, filing type, filing date, period of report, SIC, and filename stay attached to each item row.

filings <- from_json_to_df(
  json_files,
  what = "10-K",
  ncores = 1,
  max_chunk_size = 1
)
#> 
#> ── Flattening JSON files ──
#> 
#>  Reading JSON files with RcppSimdJson
#>  Processing chunk 1/5 with 1 files
#>  Processing chunk 2/5 with 1 files
#>  Processing chunk 3/5 with 1 files
#>  Processing chunk 4/5 with 1 files
#>  Processing chunk 5/5 with 1 files
#>  Reshaping JSON data in parallel with 1 cores
#>  Conversion has been successful

nrow(filings)
#> [1] 96
filings[, .N, by = .(company, filing_type, fyear)]
#>                 company filing_type fyear     N
#>                  <char>      <char> <int> <int>
#> 1:             AAR CORP        10-K  2025    20
#> 2:  ABBOTT LABORATORIES        10-K  2024    20
#> 3:           WORLDS INC        10-K  2024    16
#> 4:     ACME UNITED CORP        10-K  2024    20
#> 5: BK Technologies Corp        10-K  2024    20
filings[, .N, by = item][order(item)]
#>        item     N
#>      <char> <int>
#>  1:  item_1     5
#>  2: item_10     5
#>  3: item_11     5
#>  4: item_12     5
#>  5: item_13     5
#>  6: item_14     5
#>  7: item_15     5
#>  8: item_1A     4
#>  9: item_1B     4
#> 10:  item_2     5
#> 11:  item_3     5
#> 12:  item_4     5
#> 13:  item_5     5
#> 14:  item_6     4
#> 15:  item_7     5
#> 16: item_7A     4
#> 17:  item_8     5
#> 18:  item_9     5
#> 19: item_9A     5
#> 20: item_9B     5
#>        item     N
#>      <char> <int>

The full imported table contains all non-empty 10-K sections from the five filings. For the rest of the vignette, we focus on three sections that are easy to interpret: business description (item_1), risk factors (item_1A), and management discussion (item_7).

analysis_dt <- filings[
  item %in% c("item_1", "item_1A", "item_7"),
  .(cik, company, filing_type, filing_date, period_of_report,
    fyear, sic, filename, item, text)
]

analysis_dt[, text_chars := nchar(text)]
analysis_dt[, .(company, item, text_chars)][order(company, item)]
#>                  company    item text_chars
#>                   <char>  <char>      <int>
#>  1:             AAR CORP  item_1      35022
#>  2:             AAR CORP item_1A      69287
#>  3:             AAR CORP  item_7      46144
#>  4:  ABBOTT LABORATORIES  item_1      32937
#>  5:  ABBOTT LABORATORIES item_1A      33699
#>  6:  ABBOTT LABORATORIES  item_7      69225
#>  7:     ACME UNITED CORP  item_1      14889
#>  8:     ACME UNITED CORP item_1A      46771
#>  9:     ACME UNITED CORP  item_7      16462
#> 10: BK Technologies Corp  item_1      30150
#> 11: BK Technologies Corp item_1A      61711
#> 12: BK Technologies Corp  item_7      41450
#> 13:           WORLDS INC  item_1       8226
#> 14:           WORLDS INC  item_7      64158

Build a Quanteda Corpus

define_corpus() turns the filing-item table into a quanteda corpus. It uses the filing filename and item name to create stable document identifiers, while the remaining columns become document variables.

corp <- define_corpus(data.table::copy(analysis_dt))
#> 
#> ── Building corpus from data.table ──
#> 
#>  Corpus built with 14 documents

summary(corp, n = 6)
#> Corpus consisting of 14 documents, showing 6 documents:
#> 
#>                                        Text Types Tokens Sentences  cik
#>   1750_10K_2025_0001410578-25-001475_item_1  1536   5904       223 1750
#>  1750_10K_2025_0001410578-25-001475_item_1A  1871  11621       317 1750
#>   1750_10K_2025_0001410578-25-001475_item_7  1550   7815       242 1750
#>   1800_10K_2024_0001628280-25-007110_item_1  1511   5402       139 1800
#>  1800_10K_2024_0001628280-25-007110_item_1A  1145   5430       161 1800
#>   1800_10K_2024_0001628280-25-007110_item_7  2066  11774       422 1800
#>              company filing_type filing_date period_of_report fyear  sic
#>             AAR CORP        10-K  2025-07-22       2025-05-31  2025 3720
#>             AAR CORP        10-K  2025-07-22       2025-05-31  2025 3720
#>             AAR CORP        10-K  2025-07-22       2025-05-31  2025 3720
#>  ABBOTT LABORATORIES        10-K  2025-02-21       2024-12-31  2024 2834
#>  ABBOTT LABORATORIES        10-K  2025-02-21       2024-12-31  2024 2834
#>  ABBOTT LABORATORIES        10-K  2025-02-21       2024-12-31  2024 2834
#>                                filename    item text_chars
#>  1750_10K_2025_0001410578-25-001475.htm  item_1      35022
#>  1750_10K_2025_0001410578-25-001475.htm item_1A      69287
#>  1750_10K_2025_0001410578-25-001475.htm  item_7      46144
#>  1800_10K_2024_0001628280-25-007110.htm  item_1      32937
#>  1800_10K_2024_0001628280-25-007110.htm item_1A      33699
#>  1800_10K_2024_0001628280-25-007110.htm  item_7      69225
#>                           filename2
#>  1750_10K_2025_0001410578-25-001475
#>  1750_10K_2025_0001410578-25-001475
#>  1750_10K_2025_0001410578-25-001475
#>  1800_10K_2024_0001628280-25-007110
#>  1800_10K_2024_0001628280-25-007110
#>  1800_10K_2024_0001628280-25-007110
quanteda::docvars(corp)[1:6, c("company", "item", "fyear")]
#>               company    item fyear
#> 1            AAR CORP  item_1  2025
#> 2            AAR CORP item_1A  2025
#> 3            AAR CORP  item_7  2025
#> 4 ABBOTT LABORATORIES  item_1  2024
#> 5 ABBOTT LABORATORIES item_1A  2024
#> 6 ABBOTT LABORATORIES  item_7  2024

At this point, each filing item is a separate document. This is usually the most useful unit for downstream analysis because risk factors, business descriptions, and management discussion sections often answer different research questions.

Tokenize and Build a DFM

tokenize_corpus() wraps quanteda::tokens() and keeps document order stable. For small examples, a single core is enough. On larger corpora, the same function can split the corpus into chunks and process them through the package’s parallel helper.

toks <- tokenize_corpus(
  corp,
  ncores = 1,
  remove_punct = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE
)
#> 
#> ── Tokenizing corpus ──
#> 
#>  quanteda::tokens() has been called with user parameters
#> → remove_punct = TRUE
#> → remove_numbers = TRUE
#> → remove_symbols = TRUE
#>  Tokenizing sequentially
#>  Corpus successfully tokenized

toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))

dfmat <- quanteda::dfm(toks)
dfmat <- quanteda::dfm_trim(dfmat, min_termfreq = 5)

dfmat
#> Document-feature matrix of: 14 documents, 1,847 features (58.41% sparse) and 11 docvars.
#>                                             features
#> docs                                         item business general aar corp
#>   1750_10K_2025_0001410578-25-001475_item_1     3       28       8  27    2
#>   1750_10K_2025_0001410578-25-001475_item_1A    1       94       5   0    0
#>   1750_10K_2025_0001410578-25-001475_item_7     3       30       8   0    0
#>   1800_10K_2024_0001628280-25-007110_item_1     1       23       6   0    0
#>   1800_10K_2024_0001628280-25-007110_item_1A    9       36       2   0    0
#>   1800_10K_2024_0001628280-25-007110_item_7     5       27       5   0    0
#>                                             features
#> docs                                         subsidiaries herein company us
#>   1750_10K_2025_0001410578-25-001475_item_1             1      1       7 13
#>   1750_10K_2025_0001410578-25-001475_item_1A            2      0       6 50
#>   1750_10K_2025_0001410578-25-001475_item_7             1      0       5  9
#>   1800_10K_2024_0001628280-25-007110_item_1             1      0       1  0
#>   1800_10K_2024_0001628280-25-007110_item_1A            0      0       1  0
#>   1800_10K_2024_0001628280-25-007110_item_7             1      0       6  0
#>                                             features
#> docs                                         unless
#>   1750_10K_2025_0001410578-25-001475_item_1       1
#>   1750_10K_2025_0001410578-25-001475_item_1A      1
#>   1750_10K_2025_0001410578-25-001475_item_7       1
#>   1800_10K_2024_0001628280-25-007110_item_1       0
#>   1800_10K_2024_0001628280-25-007110_item_1A      0
#>   1800_10K_2024_0001628280-25-007110_item_7       1
#> [ reached max_ndoc ... 8 more documents, reached max_nfeat ... 1,837 more features ]
quanteda::topfeatures(dfmat, 15)
#>   products        may   business  financial      sales    company    million 
#>        514        490        453        397        358        304        301 
#>  including operations     abbott    results        u.s  customers    product 
#>        284        276        237        225        221        203        198 
#>   services 
#>        192

Singularization is optional, but it is often helpful when the analysis should count plural and singular nouns together. singularize_tokens() applies NLPstudio’s internal English singularization rules to the token vocabulary and then replaces only the affected token types, so the operation remains fast even when the corpus is much larger than this example.

singular_toks <- singularize_tokens(toks, ncores = 1, min_char = 4)
#> 
#> ── Singularizing tokens ──
#> 
#>  Building DFM and extracting vocabulary
#>  Removing tokens containing any number
#>  Removing tokens shorter than 4 characters
#>  Processing sequentially
#>  Replacing plural tokens with singulars
#>  Singularization complete
singular_dfm <- quanteda::dfm(singular_toks)
quanteda::topfeatures(singular_dfm, 15)
#>   product  business      sale financial    result   company   million  customer 
#>       712       495       397       397       371       345       302       300 
#> including operation      cost   service      year    abbott condition 
#>       284       283       264       261       246       237       221

Apply a Dictionary

NLPstudio ships domain dictionaries as quanteda dictionary objects. Dictionary lookup is useful when the construct is known in advance and the goal is to measure document-level exposure to a curated vocabulary.

fls_toks <- lookup_tokens(
  toks,
  dictionary = data_dictionary_BozanicRoulstoneVanBuskirk_FLS,
  ncores = 1,
  exclusive = FALSE
)
#>  quanteda::tokens_lookup() has been called with user parameters
#>  exclusive = FALSE
#> 
#> ── Lookup tokens ──
#> 
#>  Lookup sequentially
#>  Lookup complete

fls_dfm <- quanteda::dfm(fls_toks)
quanteda::topfeatures(fls_dfm, 10)
#> fls_bozanicroulstonevanbuskirk                       products 
#>                            889                            514 
#>                       business                      financial 
#>                            453                            397 
#>                          sales                        company 
#>                            358                            304 
#>                        million                      including 
#>                            301                            284 
#>                     operations                         abbott 
#>                            276                            237

The same pattern works with the other bundled dictionaries, including firm complexity, corporate social responsibility, sustainable development goals, and the Feng Li forward-looking statement dictionary.

Reshape and Summarize the Corpus

reshape_corpus() changes the document unit. For example, item-level documents can be split into sentence-level documents before computing readability or other document-level statistics.

sentence_corp <- reshape_corpus(corp, to = "sentences", ncores = 1)
#> 
#> ── Reshaping corpus ──
#> 
#>  Reshaping sequentially
#>  Corpus successfully reshaped

quanteda::ndoc(corp)
#> [1] 14
quanteda::ndoc(sentence_corp)
#> [1] 3101

summarize_corpus() returns a compact table of document-level corpus statistics from quanteda.textstats.

corpus_summary <- summarize_corpus(corp, ncores = 1)
#> 
#> ── Summarizing corpus ──
#> 
#>  Summarizing sequentially
#>  Corpus summarization complete

corpus_summary[1:6]
#>                                        doc_id chars sents tokens types puncts
#>                                        <char> <int> <int>  <int> <int>  <int>
#> 1:  1750_10K_2025_0001410578-25-001475_item_1 35022   223   5904  1399    755
#> 2: 1750_10K_2025_0001410578-25-001475_item_1A 69287   317  11621  1749   1268
#> 3:  1750_10K_2025_0001410578-25-001475_item_7 46144   242   7815  1391    894
#> 4:  1800_10K_2024_0001628280-25-007110_item_1 32937   139   5402  1398    779
#> 5: 1800_10K_2024_0001628280-25-007110_item_1A 33699   161   5430  1066    698
#> 6:  1800_10K_2024_0001628280-25-007110_item_7 69225   422  11774  1859   1253
#>    numbers symbols  urls  tags emojis
#>      <int>   <int> <int> <int>  <int>
#> 1:     121      16     1     0      0
#> 2:      42      52     0     0      0
#> 3:     244      83     0     0      0
#> 4:      34     149     3     0      0
#> 5:      17       1     0     0      0
#> 6:     655     187     0     0      0

Readability

Readability scores are often useful when filing sections differ substantially in length and complexity. calculate_readability() keeps the same corpus input and returns a data.table.

readability <- calculate_readability(
  corp,
  ncores = 1,
  measure = c("Flesch", "Flesch.Kincaid")
)
#> 
#> ── Calculating readability ──
#> 
#>  quanteda.textstats::textstat_readability() has been called with the following parameters
#>  measure = Flesch, Flesch.Kincaid
#>  Computing readability sequentially
#>  Done

readability[1:6]
#>                                        doc_id    Flesch Flesch.Kincaid
#>                                        <char>     <num>          <num>
#> 1:  1750_10K_2025_0001410578-25-001475_item_1 21.686834       16.03184
#> 2: 1750_10K_2025_0001410578-25-001475_item_1A 10.371200       19.96307
#> 3:  1750_10K_2025_0001410578-25-001475_item_7 19.445644       17.74530
#> 4:  1800_10K_2024_0001628280-25-007110_item_1  2.088124       21.34135
#> 5: 1800_10K_2024_0001628280-25-007110_item_1A  6.552317       19.70566
#> 6:  1800_10K_2024_0001628280-25-007110_item_7 26.486946       15.84038

Similarity and Distance

calculate_similarity() and calculate_distance() work on a quanteda dfm. They expose quanteda’s internal threaded computation through ncores, avoiding the incorrect pattern of splitting an all-pairs matrix across document chunks.

similarity <- calculate_similarity(
  dfmat,
  ncores = 1,
  margin = "documents",
  method = "cosine"
)
#> 
#> ── Calculating similarity ──
#> 
#>  textstat_simil() called with the following parameters
#> → margin = documents
#> → method = cosine
#>  Using 1 thread(s) via quanteda
#>  Done

distance <- calculate_distance(
  dfmat,
  ncores = 1,
  margin = "documents",
  method = "euclidean"
)
#> 
#> ── Calculating distance ──
#> 
#>  textstat_dist() called with the following parameters
#> → margin = documents
#> → method = euclidean
#>  Using 1 thread(s) via quanteda
#>  Done

round(as.matrix(similarity)[1:5, 1:5], 3)
#>                                            1750_10K_2025_0001410578-25-001475_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1                                      1.000
#> 1750_10K_2025_0001410578-25-001475_item_1A                                     0.533
#> 1750_10K_2025_0001410578-25-001475_item_7                                      0.611
#> 1800_10K_2024_0001628280-25-007110_item_1                                      0.366
#> 1800_10K_2024_0001628280-25-007110_item_1A                                     0.281
#>                                            1750_10K_2025_0001410578-25-001475_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1                                       0.533
#> 1750_10K_2025_0001410578-25-001475_item_1A                                      1.000
#> 1750_10K_2025_0001410578-25-001475_item_7                                       0.510
#> 1800_10K_2024_0001628280-25-007110_item_1                                       0.404
#> 1800_10K_2024_0001628280-25-007110_item_1A                                      0.615
#>                                            1750_10K_2025_0001410578-25-001475_item_7
#> 1750_10K_2025_0001410578-25-001475_item_1                                      0.611
#> 1750_10K_2025_0001410578-25-001475_item_1A                                     0.510
#> 1750_10K_2025_0001410578-25-001475_item_7                                      1.000
#> 1800_10K_2024_0001628280-25-007110_item_1                                      0.244
#> 1800_10K_2024_0001628280-25-007110_item_1A                                     0.321
#>                                            1800_10K_2024_0001628280-25-007110_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1                                      0.366
#> 1750_10K_2025_0001410578-25-001475_item_1A                                     0.404
#> 1750_10K_2025_0001410578-25-001475_item_7                                      0.244
#> 1800_10K_2024_0001628280-25-007110_item_1                                      1.000
#> 1800_10K_2024_0001628280-25-007110_item_1A                                     0.673
#>                                            1800_10K_2024_0001628280-25-007110_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1                                       0.281
#> 1750_10K_2025_0001410578-25-001475_item_1A                                      0.615
#> 1750_10K_2025_0001410578-25-001475_item_7                                       0.321
#> 1800_10K_2024_0001628280-25-007110_item_1                                       0.673
#> 1800_10K_2024_0001628280-25-007110_item_1A                                      1.000
round(as.matrix(distance)[1:5, 1:5], 3)
#>                                            1750_10K_2025_0001410578-25-001475_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1                                      0.000
#> 1750_10K_2025_0001410578-25-001475_item_1A                                   268.635
#> 1750_10K_2025_0001410578-25-001475_item_7                                    179.897
#> 1800_10K_2024_0001628280-25-007110_item_1                                    177.634
#> 1800_10K_2024_0001628280-25-007110_item_1A                                   215.815
#>                                            1750_10K_2025_0001410578-25-001475_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1                                     268.635
#> 1750_10K_2025_0001410578-25-001475_item_1A                                      0.000
#> 1750_10K_2025_0001410578-25-001475_item_7                                     280.432
#> 1800_10K_2024_0001628280-25-007110_item_1                                     291.930
#> 1800_10K_2024_0001628280-25-007110_item_1A                                    249.978
#>                                            1750_10K_2025_0001410578-25-001475_item_7
#> 1750_10K_2025_0001410578-25-001475_item_1                                    179.897
#> 1750_10K_2025_0001410578-25-001475_item_1A                                   280.432
#> 1750_10K_2025_0001410578-25-001475_item_7                                      0.000
#> 1800_10K_2024_0001628280-25-007110_item_1                                    243.752
#> 1800_10K_2024_0001628280-25-007110_item_1A                                   249.530
#>                                            1800_10K_2024_0001628280-25-007110_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1                                    177.634
#> 1750_10K_2025_0001410578-25-001475_item_1A                                   291.930
#> 1750_10K_2025_0001410578-25-001475_item_7                                    243.752
#> 1800_10K_2024_0001628280-25-007110_item_1                                      0.000
#> 1800_10K_2024_0001628280-25-007110_item_1A                                   150.519
#>                                            1800_10K_2024_0001628280-25-007110_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1                                     215.815
#> 1750_10K_2025_0001410578-25-001475_item_1A                                    249.978
#> 1750_10K_2025_0001410578-25-001475_item_7                                     249.530
#> 1800_10K_2024_0001628280-25-007110_item_1                                     150.519
#> 1800_10K_2024_0001628280-25-007110_item_1A                                      0.000

Export Tables

Most outputs are ordinary tabular objects or can be converted to tabular form. That makes it straightforward to save intermediate results for review, appendix tables, or downstream modeling.

export_readability <- data.table::copy(readability)
export_readability[, source := "bundled_10k_json"]

out_file <- tempfile(fileext = ".csv")
data.table::fwrite(export_readability, out_file)
out_file
#> [1] "/tmp/Rtmp6SHmqZ/file1d28269d1750.csv"

Optional spaCy Parsing

parse_corpus() wraps spacyr for part-of-speech parsing and related annotations. It requires a working Python spaCy installation, so it is shown but not evaluated in this vignette.

spacyr::spacy_initialize(model = "en_core_web_sm")

parsed <- parse_corpus(
  corp,
  ncores = 1,
  lemma = TRUE,
  pos = TRUE
)

head(parsed)

Workflow Map

The non-topic-model API is designed to be composable:

Task Primary function
Flatten SEC-style JSON from_json_to_df()
Build a quanteda corpus define_corpus()
Tokenize text tokenize_corpus()
Singularize tokens singularize_tokens()
Apply dictionaries lookup_tokens()
Reshape documents reshape_corpus()
Summarize corpus summarize_corpus()
Compute readability calculate_readability()
Compute similarity calculate_similarity()
Compute distance calculate_distance()
Optional spaCy parsing parse_corpus()