Convert JSON to data.table

Converts a vector of JSON file paths into a unified data.table::data.table suitable for downstream analysis. The function is optimized for large-scale input (thousands of JSON files) and leverages both chunking and user-selected parallel backends to remain efficient and memory-safe.

Usage

from_json_to_df(
  files,
  ncores = 1,
  nchunks = ncores,
  socket = c("PSOCK", "FORK"),
  drop_late_filers = FALSE,
  what = NULL,
  drop_empty_text = TRUE,
  max_chunk_size = NULL,
  ...
)

Arguments

files: Character vector of JSON file paths.
ncores: Integer. Number of worker processes for reading and reshaping the JSON files. Defaults to 1 (sequential). JSON ingestion is I/O- and parsing-bound in external C++ (RcppSimdJson) with no shared thread pool, so process-level parallelism is the correct strategy here and is retained (unlike the quanteda-backed verbs, see tokenize_corpus()).
nchunks: Integer. Number of chunks to split the input file vector into. Defaults to ncores. The chunk size is computed as ceiling(length(files) / nchunks). Ignored if max_chunk_size is explicitly provided via ....
socket: Character. Parallel backend to use. One of "PSOCK" (default, recommended) or "FORK". On Windows, "FORK" is not supported and will trigger an error.
drop_late_filers: Logical. If TRUE, removes filings considered "late" (filing year greater than fiscal year + 1). Default FALSE.
what: Character or NULL. Recommended explicit selector for the JSON family being imported. Supported values are "10-K", "10-Q", "8-K", and "loan". When NULL (default), the function infers the family from the JSON keys for backward compatibility.
drop_empty_text: Logical. If TRUE (default), drop rows where the extracted section text is empty or missing after melting.
max_chunk_size: Integer. If provided, sets the exact number of files per chunk, overriding the value derived from nchunks. Use this for fine-grained memory control when file sizes vary significantly.
...: Additional arguments passed to internal processing steps.

Value

A data.table::data.table with one row per (document × item) and columns:

cik: Central Index Key (integer)
filing_date, period_of_report (IDate)
fyear: fiscal year (integer)
sic: industry code (integer)
item: item identifier (character)
text: filing text content (character)
plus any additional metadata extracted upstream

Details

Internally the function proceeds in three phases:

Read & parse – The input file paths are divided into chunks. By default, chunk size is derived from nchunks as ceiling(length(files) / nchunks). If max_chunk_size is explicitly provided, it overrides this calculation. Each chunk is read in parallel with RcppSimdJson::fload() and converted to data tables.
Reshape – Each parsed table is normalized and reshaped from wide to long format in parallel. The selected text columns depend on what: "10-K" uses item_*/section_*, "10-Q" uses part_* and part_*_item_*, "8-K" uses item_*, and "loan" uses the canonical section names produced by sec-crawler. When what = NULL, the function infers the family from the JSON keys for backward compatibility.
Combine & clean – Melted tables are bound together, date columns converted, fiscal year (fyear) derived, late filers optionally dropped, and column order standardized.

Parallel backends are controlled with the socket argument:

When socket = "PSOCK", parallel::clusterApplyLB() is used, which dynamically balances work across workers and is portable across operating systems.
When socket = "FORK", parallel::mclapply() is used, which can be faster on Linux/macOS because it avoids copying large objects to workers.

Chunking strategy

The function offers two ways to control chunking:

nchunks (default): Set the number of batches. With 7000 files and nchunks = 4, each batch contains ~1750 files.
max_chunk_size (explicit): Set the exact batch size. With 7000 files and max_chunk_size = 500, you get 14 batches of 500 files each. If both are relevant, max_chunk_size takes precedence.

Efficiency considerations

RcppSimdJson is used for parsing, which is significantly faster than base or jsonlite parsers on large files.
Chunking is controlled by nchunks for load balancing, while max_chunk_size provides an upper bound on files per chunk to prevent memory overload.
socket = "FORK" is generally preferred on Linux/macOS for speed, while socket = "PSOCK" is more portable and provides dynamic load balancing.

Examples

if (FALSE) { # interactive()
# Requires the optional RcppSimdJson package and a directory of SEC-style
# JSON filings supplied by the user.
files <- list.files(
  "data/json_filings",
  pattern = "\\\\.json$",
  recursive = TRUE,
  full.names = TRUE
)

dt <- from_json_to_df(
  files,
  what = "10-K",
  ncores = 2
)

head(dt)
}