Converts a vector of JSON file paths into a unified data.table::data.table suitable for downstream analysis. The function is optimized for large-scale input (thousands of JSON files) and leverages both chunking and user-selected parallel backends to remain efficient and memory-safe.
Usage
from_json_to_df(
files,
ncores = 1,
nchunks = ncores,
socket = c("PSOCK", "FORK"),
drop_late_filers = FALSE,
what = NULL,
drop_empty_text = TRUE,
max_chunk_size = NULL,
...
)Arguments
- files
Character vector of JSON file paths.
- ncores
Integer. Number of CPU cores to use for parallel processing. Defaults to 1 (sequential).
- nchunks
Integer. Number of chunks to split the input file vector into. Defaults to
ncores. The chunk size is computed asceiling(length(files) / nchunks). Ignored ifmax_chunk_sizeis explicitly provided via....- socket
Character. Parallel backend to use. One of
"PSOCK"(default, recommended) or"FORK". On Windows,"FORK"is not supported and will trigger an error.- drop_late_filers
Logical. If
TRUE, removes filings considered "late" (filing year greater than fiscal year + 1). DefaultFALSE.- what
Character or
NULL. Recommended explicit selector for the JSON family being imported. Supported values are"10-K","10-Q","8-K", and"loan". WhenNULL(default), the function infers the family from the JSON keys for backward compatibility.- drop_empty_text
Logical. If
TRUE(default), drop rows where the extracted section text is empty or missing after melting.- max_chunk_size
Integer. If provided, sets the exact number of files per chunk, overriding the value derived from
nchunks. Use this for fine-grained memory control when file sizes vary significantly.- ...
Additional arguments passed to internal processing steps.
Value
A data.table::data.table with one row per (document × item)
and columns:
cik: Central Index Key (integer)filing_date,period_of_report(IDate)fyear: fiscal year (integer)sic: industry code (integer)item: item identifier (character)text: filing text content (character)plus any additional metadata extracted upstream
Details
Internally the function proceeds in three phases:
Read & parse – The input file paths are divided into chunks. By default, chunk size is derived from
nchunksasceiling(length(files) / nchunks). Ifmax_chunk_sizeis explicitly provided, it overrides this calculation. Each chunk is read in parallel withRcppSimdJson::fload()and converted to data tables.Reshape – Each parsed table is normalized and reshaped from wide to long format in parallel. The selected text columns depend on
what:"10-K"usesitem_*/section_*,"10-Q"usespart_*andpart_*_item_*,"8-K"usesitem_*, and"loan"uses the canonical section names produced bysec-crawler. Whenwhat = NULL, the function infers the family from the JSON keys for backward compatibility.Combine & clean – Melted tables are bound together, date columns converted, fiscal year (
fyear) derived, late filers optionally dropped, and column order standardized.
Parallel backends are controlled with the socket argument:
When
socket = "PSOCK",parallel::clusterApplyLB()is used, which dynamically balances work across workers and is portable across operating systems.When
socket = "FORK",parallel::mclapply()is used, which can be faster on Linux/macOS because it avoids copying large objects to workers.
Chunking strategy
The function offers two ways to control chunking:
nchunks(default): Set the number of batches. With 7000 files andnchunks = 4, each batch contains ~1750 files.max_chunk_size(explicit): Set the exact batch size. With 7000 files andmax_chunk_size = 500, you get 14 batches of 500 files each. If both are relevant,max_chunk_sizetakes precedence.
Efficiency considerations
RcppSimdJson is used for parsing, which is significantly faster than base or jsonlite parsers on large files.
Chunking is controlled by
nchunksfor load balancing, whilemax_chunk_sizeprovides an upper bound on files per chunk to prevent memory overload.socket = "FORK"is generally preferred on Linux/macOS for speed, whilesocket = "PSOCK"is more portable and provides dynamic load balancing.
Examples
if (FALSE) { # interactive()
# Requires the optional RcppSimdJson package and a directory of SEC-style
# JSON filings supplied by the user.
files <- list.files(
"data/json_filings",
pattern = "\\\\.json$",
recursive = TRUE,
full.names = TRUE
)
dt <- from_json_to_df(
files,
what = "10-K",
ncores = 2
)
head(dt)
}
