
Corpus Preparation and Text Analysis
From SEC-Style JSON to Quanteda Workflows
by Francesco Grossetti
Source:vignettes/corpus-workflow.Rmd
corpus-workflow.RmdNLPstudio is designed for workflows where text starts outside R: scraped filings, structured JSON records, large corpora, and document metadata that needs to stay aligned with the text. The topic-model vignette starts after the modeling input is already prepared. This vignette covers the earlier part of the workflow: getting filing text into a reliable corpus and producing common document-level features.
The examples use five bundled 10-K JSON filings. They are small
enough to ship with the package but realistic enough to show the
intended data path. Each JSON file contains filing metadata and
top-level item_* fields, such as item_1,
item_1A, and item_7.
The JSON import chunks require the optional RcppSimdJson package. If it is not installed, the vignette still knits and shows the code, but the JSON-backed chunks are not evaluated.
Locate the Example JSON Files
The example files live in inst/extdata/json/ in the
source package. Once the package is installed, they are available
through system.file().
library(NLPstudio)
#> ── NLPstudio 1.0.1 ────────────────── https://github.com/contefranz/NLPstudio ──
#> Core imports: cli, data.table, ggplot2, Matrix, methods, quanteda,
#> quanteda.textstats
#> Optional backends: text2vec, topicmodels, seededlda, stm,
#> topicmodels.etm, torch, spacyr, tidytext,
#> RcppSimdJson, uwot
#> Use library(<pkg>) to attach any of these to your session.
#> Optional packages are only needed for the functions that use them.
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following object is masked from 'package:base':
#>
#> %notin%
library(quanteda)
#> Package version: 4.4
#> Unicode version: 15.1
#> ICU version: 74.2
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
json_dir <- system.file("extdata", "json", package = "NLPstudio")
if (!nzchar(json_dir)) {
json_dir <- file.path("..", "inst", "extdata", "json")
}
json_files <- list.files(
json_dir,
pattern = "_10K_.*\\.json$",
full.names = TRUE
)
basename(json_files)
#> [1] "1750_10K_2025_0001410578-25-001475.json"
#> [2] "1800_10K_2024_0001628280-25-007110.json"
#> [3] "1961_10K_2024_0001264931-25-000008.json"
#> [4] "2098_10K_2024_0000950170-25-034843.json"
#> [5] "2186_10K_2024_0001437749-25-009464.json"Flatten JSON to a Filing-Item Table
from_json_to_df() reads the JSON files, identifies the
10-K item fields, and returns one row per filing item. Metadata such as
CIK, company name, filing type, filing date, period of report, SIC, and
filename stay attached to each item row.
filings <- from_json_to_df(
json_files,
what = "10-K",
ncores = 1,
max_chunk_size = 1
)
#>
#> ── Flattening JSON files ──
#>
#> ℹ Reading JSON files with RcppSimdJson
#> ℹ Processing chunk 1/5 with 1 files
#> ℹ Processing chunk 2/5 with 1 files
#> ℹ Processing chunk 3/5 with 1 files
#> ℹ Processing chunk 4/5 with 1 files
#> ℹ Processing chunk 5/5 with 1 files
#> ℹ Reshaping JSON data in parallel with 1 cores
#> ✔ Conversion has been successful
nrow(filings)
#> [1] 96
filings[, .N, by = .(company, filing_type, fyear)]
#> company filing_type fyear N
#> <char> <char> <int> <int>
#> 1: AAR CORP 10-K 2025 20
#> 2: ABBOTT LABORATORIES 10-K 2024 20
#> 3: WORLDS INC 10-K 2024 16
#> 4: ACME UNITED CORP 10-K 2024 20
#> 5: BK Technologies Corp 10-K 2024 20
filings[, .N, by = item][order(item)]
#> item N
#> <char> <int>
#> 1: item_1 5
#> 2: item_10 5
#> 3: item_11 5
#> 4: item_12 5
#> 5: item_13 5
#> 6: item_14 5
#> 7: item_15 5
#> 8: item_1A 4
#> 9: item_1B 4
#> 10: item_2 5
#> 11: item_3 5
#> 12: item_4 5
#> 13: item_5 5
#> 14: item_6 4
#> 15: item_7 5
#> 16: item_7A 4
#> 17: item_8 5
#> 18: item_9 5
#> 19: item_9A 5
#> 20: item_9B 5
#> item N
#> <char> <int>The full imported table contains all non-empty 10-K sections from the
five filings. For the rest of the vignette, we focus on three sections
that are easy to interpret: business description (item_1),
risk factors (item_1A), and management discussion
(item_7).
analysis_dt <- filings[
item %in% c("item_1", "item_1A", "item_7"),
.(cik, company, filing_type, filing_date, period_of_report,
fyear, sic, filename, item, text)
]
analysis_dt[, text_chars := nchar(text)]
analysis_dt[, .(company, item, text_chars)][order(company, item)]
#> company item text_chars
#> <char> <char> <int>
#> 1: AAR CORP item_1 35022
#> 2: AAR CORP item_1A 69287
#> 3: AAR CORP item_7 46144
#> 4: ABBOTT LABORATORIES item_1 32937
#> 5: ABBOTT LABORATORIES item_1A 33699
#> 6: ABBOTT LABORATORIES item_7 69225
#> 7: ACME UNITED CORP item_1 14889
#> 8: ACME UNITED CORP item_1A 46771
#> 9: ACME UNITED CORP item_7 16462
#> 10: BK Technologies Corp item_1 30150
#> 11: BK Technologies Corp item_1A 61711
#> 12: BK Technologies Corp item_7 41450
#> 13: WORLDS INC item_1 8226
#> 14: WORLDS INC item_7 64158Build a Quanteda Corpus
define_corpus() turns the filing-item table into a
quanteda corpus. It uses the filing filename and item name
to create stable document identifiers, while the remaining columns
become document variables.
corp <- define_corpus(data.table::copy(analysis_dt))
#>
#> ── Building corpus from data.table ──
#>
#> ✔ Corpus built with 14 documents
summary(corp, n = 6)
#> Corpus consisting of 14 documents, showing 6 documents:
#>
#> Text Types Tokens Sentences cik
#> 1750_10K_2025_0001410578-25-001475_item_1 1536 5904 223 1750
#> 1750_10K_2025_0001410578-25-001475_item_1A 1871 11621 317 1750
#> 1750_10K_2025_0001410578-25-001475_item_7 1550 7815 242 1750
#> 1800_10K_2024_0001628280-25-007110_item_1 1511 5402 139 1800
#> 1800_10K_2024_0001628280-25-007110_item_1A 1145 5430 161 1800
#> 1800_10K_2024_0001628280-25-007110_item_7 2066 11774 422 1800
#> company filing_type filing_date period_of_report fyear sic
#> AAR CORP 10-K 2025-07-22 2025-05-31 2025 3720
#> AAR CORP 10-K 2025-07-22 2025-05-31 2025 3720
#> AAR CORP 10-K 2025-07-22 2025-05-31 2025 3720
#> ABBOTT LABORATORIES 10-K 2025-02-21 2024-12-31 2024 2834
#> ABBOTT LABORATORIES 10-K 2025-02-21 2024-12-31 2024 2834
#> ABBOTT LABORATORIES 10-K 2025-02-21 2024-12-31 2024 2834
#> filename item text_chars
#> 1750_10K_2025_0001410578-25-001475.htm item_1 35022
#> 1750_10K_2025_0001410578-25-001475.htm item_1A 69287
#> 1750_10K_2025_0001410578-25-001475.htm item_7 46144
#> 1800_10K_2024_0001628280-25-007110.htm item_1 32937
#> 1800_10K_2024_0001628280-25-007110.htm item_1A 33699
#> 1800_10K_2024_0001628280-25-007110.htm item_7 69225
#> filename2
#> 1750_10K_2025_0001410578-25-001475
#> 1750_10K_2025_0001410578-25-001475
#> 1750_10K_2025_0001410578-25-001475
#> 1800_10K_2024_0001628280-25-007110
#> 1800_10K_2024_0001628280-25-007110
#> 1800_10K_2024_0001628280-25-007110
quanteda::docvars(corp)[1:6, c("company", "item", "fyear")]
#> company item fyear
#> 1 AAR CORP item_1 2025
#> 2 AAR CORP item_1A 2025
#> 3 AAR CORP item_7 2025
#> 4 ABBOTT LABORATORIES item_1 2024
#> 5 ABBOTT LABORATORIES item_1A 2024
#> 6 ABBOTT LABORATORIES item_7 2024At this point, each filing item is a separate document. This is usually the most useful unit for downstream analysis because risk factors, business descriptions, and management discussion sections often answer different research questions.
Tokenize and Build a DFM
tokenize_corpus() wraps quanteda::tokens()
and keeps document order stable. For small examples, a single core is
enough. On larger corpora, the same function can split the corpus into
chunks and process them through the package’s parallel helper.
toks <- tokenize_corpus(
corp,
ncores = 1,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE
)
#>
#> ── Tokenizing corpus ──
#>
#> ℹ quanteda::tokens() has been called with user parameters
#> → remove_punct = TRUE
#> → remove_numbers = TRUE
#> → remove_symbols = TRUE
#> ℹ Tokenizing sequentially
#> ✔ Corpus successfully tokenized
toks <- quanteda::tokens_tolower(toks)
toks <- quanteda::tokens_remove(toks, pattern = quanteda::stopwords("en"))
dfmat <- quanteda::dfm(toks)
dfmat <- quanteda::dfm_trim(dfmat, min_termfreq = 5)
dfmat
#> Document-feature matrix of: 14 documents, 1,847 features (58.41% sparse) and 11 docvars.
#> features
#> docs item business general aar corp
#> 1750_10K_2025_0001410578-25-001475_item_1 3 28 8 27 2
#> 1750_10K_2025_0001410578-25-001475_item_1A 1 94 5 0 0
#> 1750_10K_2025_0001410578-25-001475_item_7 3 30 8 0 0
#> 1800_10K_2024_0001628280-25-007110_item_1 1 23 6 0 0
#> 1800_10K_2024_0001628280-25-007110_item_1A 9 36 2 0 0
#> 1800_10K_2024_0001628280-25-007110_item_7 5 27 5 0 0
#> features
#> docs subsidiaries herein company us
#> 1750_10K_2025_0001410578-25-001475_item_1 1 1 7 13
#> 1750_10K_2025_0001410578-25-001475_item_1A 2 0 6 50
#> 1750_10K_2025_0001410578-25-001475_item_7 1 0 5 9
#> 1800_10K_2024_0001628280-25-007110_item_1 1 0 1 0
#> 1800_10K_2024_0001628280-25-007110_item_1A 0 0 1 0
#> 1800_10K_2024_0001628280-25-007110_item_7 1 0 6 0
#> features
#> docs unless
#> 1750_10K_2025_0001410578-25-001475_item_1 1
#> 1750_10K_2025_0001410578-25-001475_item_1A 1
#> 1750_10K_2025_0001410578-25-001475_item_7 1
#> 1800_10K_2024_0001628280-25-007110_item_1 0
#> 1800_10K_2024_0001628280-25-007110_item_1A 0
#> 1800_10K_2024_0001628280-25-007110_item_7 1
#> [ reached max_ndoc ... 8 more documents, reached max_nfeat ... 1,837 more features ]
quanteda::topfeatures(dfmat, 15)
#> products may business financial sales company million
#> 514 490 453 397 358 304 301
#> including operations abbott results u.s customers product
#> 284 276 237 225 221 203 198
#> services
#> 192Singularization is optional, but it is often helpful when the
analysis should count plural and singular nouns together.
singularize_tokens() applies NLPstudio’s internal English
singularization rules to the token vocabulary and then replaces only the
affected token types, so the operation remains fast even when the corpus
is much larger than this example.
singular_toks <- singularize_tokens(toks, ncores = 1, min_char = 4)
#>
#> ── Singularizing tokens ──
#>
#> ℹ Building DFM and extracting vocabulary
#> ℹ Removing tokens containing any number
#> ℹ Removing tokens shorter than 4 characters
#> ℹ Processing sequentially
#> ℹ Replacing plural tokens with singulars
#> ✔ Singularization complete
singular_dfm <- quanteda::dfm(singular_toks)
quanteda::topfeatures(singular_dfm, 15)
#> product business sale financial result company million customer
#> 712 495 397 397 371 345 302 300
#> including operation cost service year abbott condition
#> 284 283 264 261 246 237 221Apply a Dictionary
NLPstudio ships domain dictionaries as quanteda
dictionary objects. Dictionary lookup is useful when the construct is
known in advance and the goal is to measure document-level exposure to a
curated vocabulary.
fls_toks <- lookup_tokens(
toks,
dictionary = data_dictionary_BozanicRoulstoneVanBuskirk_FLS,
ncores = 1,
exclusive = FALSE
)
#> ℹ quanteda::tokens_lookup() has been called with user parameters
#> ℹ exclusive = FALSE
#>
#> ── Lookup tokens ──
#>
#> ℹ Lookup sequentially
#> ✔ Lookup complete
fls_dfm <- quanteda::dfm(fls_toks)
quanteda::topfeatures(fls_dfm, 10)
#> fls_bozanicroulstonevanbuskirk products
#> 889 514
#> business financial
#> 453 397
#> sales company
#> 358 304
#> million including
#> 301 284
#> operations abbott
#> 276 237The same pattern works with the other bundled dictionaries, including firm complexity, corporate social responsibility, sustainable development goals, and the Feng Li forward-looking statement dictionary.
Reshape and Summarize the Corpus
reshape_corpus() changes the document unit. For example,
item-level documents can be split into sentence-level documents before
computing readability or other document-level statistics.
sentence_corp <- reshape_corpus(corp, to = "sentences", ncores = 1)
#>
#> ── Reshaping corpus ──
#>
#> ℹ Reshaping sequentially
#> ✔ Corpus successfully reshaped
quanteda::ndoc(corp)
#> [1] 14
quanteda::ndoc(sentence_corp)
#> [1] 3101summarize_corpus() returns a compact table of
document-level corpus statistics from
quanteda.textstats.
corpus_summary <- summarize_corpus(corp, ncores = 1)
#>
#> ── Summarizing corpus ──
#>
#> ℹ Summarizing sequentially
#> ✔ Corpus summarization complete
corpus_summary[1:6]
#> doc_id chars sents tokens types puncts
#> <char> <int> <int> <int> <int> <int>
#> 1: 1750_10K_2025_0001410578-25-001475_item_1 35022 223 5904 1399 755
#> 2: 1750_10K_2025_0001410578-25-001475_item_1A 69287 317 11621 1749 1268
#> 3: 1750_10K_2025_0001410578-25-001475_item_7 46144 242 7815 1391 894
#> 4: 1800_10K_2024_0001628280-25-007110_item_1 32937 139 5402 1398 779
#> 5: 1800_10K_2024_0001628280-25-007110_item_1A 33699 161 5430 1066 698
#> 6: 1800_10K_2024_0001628280-25-007110_item_7 69225 422 11774 1859 1253
#> numbers symbols urls tags emojis
#> <int> <int> <int> <int> <int>
#> 1: 121 16 1 0 0
#> 2: 42 52 0 0 0
#> 3: 244 83 0 0 0
#> 4: 34 149 3 0 0
#> 5: 17 1 0 0 0
#> 6: 655 187 0 0 0Readability
Readability scores are often useful when filing sections differ
substantially in length and complexity.
calculate_readability() keeps the same corpus input and
returns a data.table.
readability <- calculate_readability(
corp,
ncores = 1,
measure = c("Flesch", "Flesch.Kincaid")
)
#>
#> ── Calculating readability ──
#>
#> ℹ quanteda.textstats::textstat_readability() has been called with the following parameters
#> ℹ measure = Flesch, Flesch.Kincaid
#> ℹ Computing readability sequentially
#> ✔ Done
readability[1:6]
#> doc_id Flesch Flesch.Kincaid
#> <char> <num> <num>
#> 1: 1750_10K_2025_0001410578-25-001475_item_1 21.686834 16.03184
#> 2: 1750_10K_2025_0001410578-25-001475_item_1A 10.371200 19.96307
#> 3: 1750_10K_2025_0001410578-25-001475_item_7 19.445644 17.74530
#> 4: 1800_10K_2024_0001628280-25-007110_item_1 2.088124 21.34135
#> 5: 1800_10K_2024_0001628280-25-007110_item_1A 6.552317 19.70566
#> 6: 1800_10K_2024_0001628280-25-007110_item_7 26.486946 15.84038Similarity and Distance
calculate_similarity() and
calculate_distance() work on a quanteda dfm.
They expose quanteda’s internal threaded computation through
ncores, avoiding the incorrect pattern of splitting an
all-pairs matrix across document chunks.
similarity <- calculate_similarity(
dfmat,
ncores = 1,
margin = "documents",
method = "cosine"
)
#>
#> ── Calculating similarity ──
#>
#> ℹ textstat_simil() called with the following parameters
#> → margin = documents
#> → method = cosine
#> ℹ Using 1 thread(s) via quanteda
#> ✔ Done
distance <- calculate_distance(
dfmat,
ncores = 1,
margin = "documents",
method = "euclidean"
)
#>
#> ── Calculating distance ──
#>
#> ℹ textstat_dist() called with the following parameters
#> → margin = documents
#> → method = euclidean
#> ℹ Using 1 thread(s) via quanteda
#> ✔ Done
round(as.matrix(similarity)[1:5, 1:5], 3)
#> 1750_10K_2025_0001410578-25-001475_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1 1.000
#> 1750_10K_2025_0001410578-25-001475_item_1A 0.533
#> 1750_10K_2025_0001410578-25-001475_item_7 0.611
#> 1800_10K_2024_0001628280-25-007110_item_1 0.366
#> 1800_10K_2024_0001628280-25-007110_item_1A 0.281
#> 1750_10K_2025_0001410578-25-001475_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1 0.533
#> 1750_10K_2025_0001410578-25-001475_item_1A 1.000
#> 1750_10K_2025_0001410578-25-001475_item_7 0.510
#> 1800_10K_2024_0001628280-25-007110_item_1 0.404
#> 1800_10K_2024_0001628280-25-007110_item_1A 0.615
#> 1750_10K_2025_0001410578-25-001475_item_7
#> 1750_10K_2025_0001410578-25-001475_item_1 0.611
#> 1750_10K_2025_0001410578-25-001475_item_1A 0.510
#> 1750_10K_2025_0001410578-25-001475_item_7 1.000
#> 1800_10K_2024_0001628280-25-007110_item_1 0.244
#> 1800_10K_2024_0001628280-25-007110_item_1A 0.321
#> 1800_10K_2024_0001628280-25-007110_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1 0.366
#> 1750_10K_2025_0001410578-25-001475_item_1A 0.404
#> 1750_10K_2025_0001410578-25-001475_item_7 0.244
#> 1800_10K_2024_0001628280-25-007110_item_1 1.000
#> 1800_10K_2024_0001628280-25-007110_item_1A 0.673
#> 1800_10K_2024_0001628280-25-007110_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1 0.281
#> 1750_10K_2025_0001410578-25-001475_item_1A 0.615
#> 1750_10K_2025_0001410578-25-001475_item_7 0.321
#> 1800_10K_2024_0001628280-25-007110_item_1 0.673
#> 1800_10K_2024_0001628280-25-007110_item_1A 1.000
round(as.matrix(distance)[1:5, 1:5], 3)
#> 1750_10K_2025_0001410578-25-001475_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1 0.000
#> 1750_10K_2025_0001410578-25-001475_item_1A 268.635
#> 1750_10K_2025_0001410578-25-001475_item_7 179.897
#> 1800_10K_2024_0001628280-25-007110_item_1 177.634
#> 1800_10K_2024_0001628280-25-007110_item_1A 215.815
#> 1750_10K_2025_0001410578-25-001475_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1 268.635
#> 1750_10K_2025_0001410578-25-001475_item_1A 0.000
#> 1750_10K_2025_0001410578-25-001475_item_7 280.432
#> 1800_10K_2024_0001628280-25-007110_item_1 291.930
#> 1800_10K_2024_0001628280-25-007110_item_1A 249.978
#> 1750_10K_2025_0001410578-25-001475_item_7
#> 1750_10K_2025_0001410578-25-001475_item_1 179.897
#> 1750_10K_2025_0001410578-25-001475_item_1A 280.432
#> 1750_10K_2025_0001410578-25-001475_item_7 0.000
#> 1800_10K_2024_0001628280-25-007110_item_1 243.752
#> 1800_10K_2024_0001628280-25-007110_item_1A 249.530
#> 1800_10K_2024_0001628280-25-007110_item_1
#> 1750_10K_2025_0001410578-25-001475_item_1 177.634
#> 1750_10K_2025_0001410578-25-001475_item_1A 291.930
#> 1750_10K_2025_0001410578-25-001475_item_7 243.752
#> 1800_10K_2024_0001628280-25-007110_item_1 0.000
#> 1800_10K_2024_0001628280-25-007110_item_1A 150.519
#> 1800_10K_2024_0001628280-25-007110_item_1A
#> 1750_10K_2025_0001410578-25-001475_item_1 215.815
#> 1750_10K_2025_0001410578-25-001475_item_1A 249.978
#> 1750_10K_2025_0001410578-25-001475_item_7 249.530
#> 1800_10K_2024_0001628280-25-007110_item_1 150.519
#> 1800_10K_2024_0001628280-25-007110_item_1A 0.000Export Tables
Most outputs are ordinary tabular objects or can be converted to tabular form. That makes it straightforward to save intermediate results for review, appendix tables, or downstream modeling.
Optional spaCy Parsing
parse_corpus() wraps spacyr for
part-of-speech parsing and related annotations. It requires a working
Python spaCy installation, so it is shown but not evaluated in this
vignette.
spacyr::spacy_initialize(model = "en_core_web_sm")
parsed <- parse_corpus(
corp,
ncores = 1,
lemma = TRUE,
pos = TRUE
)
head(parsed)Workflow Map
The non-topic-model API is designed to be composable:
| Task | Primary function |
|---|---|
| Flatten SEC-style JSON | from_json_to_df() |
| Build a quanteda corpus | define_corpus() |
| Tokenize text | tokenize_corpus() |
| Singularize tokens | singularize_tokens() |
| Apply dictionaries | lookup_tokens() |
| Reshape documents | reshape_corpus() |
| Summarize corpus | summarize_corpus() |
| Compute readability | calculate_readability() |
| Compute similarity | calculate_similarity() |
| Compute distance | calculate_distance() |
| Optional spaCy parsing | parse_corpus() |