Skip to content

Add support for RFC 7464 JSON text sequences and comma-delimited documents#2664

Open
jaja360 wants to merge 3 commits intosimdjson:masterfrom
jaja360:rfc7464-batch-support
Open

Add support for RFC 7464 JSON text sequences and comma-delimited documents#2664
jaja360 wants to merge 3 commits intosimdjson:masterfrom
jaja360:rfc7464-batch-support

Conversation

@jaja360
Copy link
Copy Markdown
Contributor

@jaja360 jaja360 commented Apr 6, 2026

Description

This PR adds support for two additional JSON streaming formats via the existing parse_many and iterate_many APIs:

  1. RFC 7464 JSON text sequences (application/json-seq) - Documents prefixed with RS (0x1E) character
  2. Comma-delimited documents - Documents separated by commas (e.g., {"a":1},{"b":2})

RFC 7464 was previously entirely unsupported, whereas Comma-delimited documents
were supported but only in single-batch mode without threading (via the
now-deprecated allow_comma_separated parameter). This PR implements both
formats with proper multi-batch processing and threading support.

Related issues:

Type of change

  • Bug fix
  • Optimization
  • New feature
  • Refactor / cleanup
  • Documentation / tests
  • Other (please describe):

How to verify / test

New test coverage

  • tests/dom/document_stream_tests.cpp: Added rfc7464_tests() and comma_delimited_tests()
  • tests/ondemand/ondemand_document_stream_tests.cpp: Equivalent tests for On-Demand API

Running the new tests

cmake -B build -D SIMDJSON_DEVELOPER_MODE=ON
cmake --build build -j$(nproc)

# Run the document stream tests (includes both RFC 7464 and comma-delimited tests)
ctest --test-dir build --output-on-failure -R 'document_stream_tests'

Performance verification

cmake --build build --target bench_stream_formats
./build/benchmark/bench_stream_formats --benchmark_repetitions=10

Benchmark results (AMD Zen 5, Ryzen 9 9900X)

Performance overview:

  • NDJSON: No regression observed. Performance variations are within measurement noise.
  • RFC 7464: Slightly slower than NDJSON, due to the additional post-stage1
    filtering. Since this format is currently unsupported, I don't think that's
    a concern. We can optimize further by integrating the RS handling into
    stage1 (instead of the small post-processing in this PR) in the future if needed.
  • Comma-delimited: Like RFC 7464 format, comma-delimited is slightly slower
    than NDJSON due to the post-stage1 filtering. However, this is outweighted by
    the speedup we get from the threaded support (which we didn't have in the old
    single-batch implementation).

1. NDJSON regression test (baseline vs final)

Benchmark Baseline Final Change
On-Demand Small (1M docs) 6.61 GiB/s 6.77 GiB/s +2.5%
On-Demand Large (30K docs) 22.70 GiB/s 22.96 GiB/s +1.2%
DOM Small 3.17 GiB/s 3.10 GiB/s -2.2%
DOM Large 21.36 GiB/s 22.09 GiB/s +3.4%

2. Stream format comparison

Format API Small Docs (16B payload) Large Docs (4KB payload)
NDJSON On-Demand 6.77 GiB/s 22.96 GiB/s
RFC 7464 On-Demand 6.43 GiB/s 22.10 GiB/s
Comma-delimited On-Demand 4.83 GiB/s 21.57 GiB/s
NDJSON DOM 3.10 GiB/s 22.09 GiB/s
RFC 7464 DOM 2.93 GiB/s 21.34 GiB/s
Comma-delimited DOM 3.06 GiB/s 20.86 GiB/s

3. Comma-delimited: old API vs new API

Benchmark Old allow_comma_separated=true New stream_format::comma_delimited Speedup
Small docs (2M × 68B) 3.08 GiB/s 4.83 GiB/s 1.57x
Large docs (31K × 4KB) 23.26 GiB/s 21.57 GiB/s 0.93x

Implementation notes

API additions

// New enum in simdjson/base.h
enum class stream_format {
  whitespace_delimited,  // NDJSON, whitespace-separated (default)
  json_sequence,         // RFC 7464: RS-delimited
  comma_delimited        // Comma-separated documents
};

// DOM API
simdjson_result<document_stream> parser.parse_many(
    padded_string_view input,
    size_t batch_size,
    stream_format format);

// On-Demand API
simdjson_result<document_stream> parser.iterate_many(
    padded_string_view input,
    size_t batch_size,
    stream_format format);

Deprecation

The allow_comma_separated boolean parameter is deprecated in favor of stream_format::comma_delimited:

// Old (deprecated)
parser.iterate_many(input, batch_size, true);

// New (recommended)
parser.iterate_many(input, batch_size, stream_format::comma_delimited);

Checklist before submitting

  • I added/updated tests covering my change (if applicable)
  • Code builds locally and passes all tests
  • Documentation / README updated if needed
  • Commits are atomic and messages are clear
  • I linked the related issue (if applicable)
  • Benchmark results show no regression for existing functionality

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 7, 2026

@cjauvin might be interested.

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 8, 2026

@jaja360 To be reviewed. Soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants