Add support for RFC 7464 JSON text sequences and comma-delimited documents#2664
Open
jaja360 wants to merge 3 commits intosimdjson:masterfrom
Open
Add support for RFC 7464 JSON text sequences and comma-delimited documents#2664jaja360 wants to merge 3 commits intosimdjson:masterfrom
jaja360 wants to merge 3 commits intosimdjson:masterfrom
Conversation
Member
|
@cjauvin might be interested. |
Member
|
@jaja360 To be reviewed. Soon. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds support for two additional JSON streaming formats via the existing
parse_manyanditerate_manyAPIs:application/json-seq) - Documents prefixed with RS (0x1E) character{"a":1},{"b":2})RFC 7464 was previously entirely unsupported, whereas Comma-delimited documents
were supported but only in single-batch mode without threading (via the
now-deprecated
allow_comma_separatedparameter). This PR implements bothformats with proper multi-batch processing and threading support.
Related issues:
iterate_many#2252 (this PR doesn't solve this issue, but makes it simpler to solve in the future)Type of change
How to verify / test
New test coverage
tests/dom/document_stream_tests.cpp: Addedrfc7464_tests()andcomma_delimited_tests()tests/ondemand/ondemand_document_stream_tests.cpp: Equivalent tests for On-Demand APIRunning the new tests
Performance verification
Benchmark results (AMD Zen 5, Ryzen 9 9900X)
Performance overview:
filtering. Since this format is currently unsupported, I don't think that's
a concern. We can optimize further by integrating the RS handling into
stage1 (instead of the small post-processing in this PR) in the future if needed.
than NDJSON due to the post-stage1 filtering. However, this is outweighted by
the speedup we get from the threaded support (which we didn't have in the old
single-batch implementation).
1. NDJSON regression test (baseline vs final)
2. Stream format comparison
3. Comma-delimited: old API vs new API
allow_comma_separated=truestream_format::comma_delimitedImplementation notes
API additions
Deprecation
The
allow_comma_separatedboolean parameter is deprecated in favor ofstream_format::comma_delimited:Checklist before submitting