Skip to content

Commit e4703a3

Browse files
committed
Even safer.
1 parent 8a22812 commit e4703a3

10 files changed

Lines changed: 75 additions & 22 deletions

File tree

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,15 +70,15 @@ To simplify the engineering, we make some assumptions.
7070
- We support UTF-8 (and thus ASCII), nothing else (no Latin, no UTF-16).
7171
- We assume AVX2 support which is available in all recent mainstream x86 processors produced by AMD and Intel. No support for non-x86 processors is included.
7272
- We only support GNU GCC and LLVM Clang at this time. There is no support for Microsoft Visual Studio, though it should not be difficult.
73-
- We expect the input memory pointer to be padded (e.g., with spaces) so that it can be read entirely in blocks of 512 bits (a cache line). In practice, this means that users may allocate the memory where the JSON bytes are located using the `allocate_padded_buffer` function or the equivalent. Of course, the data you may want to process could be on a buffer that does have this padding. However, copying the data is relatively cheap (much cheaper than parsing JSON), and we can eventually remove this constraint.
73+
- We expect the input memory to be readable up to 32 bytes beyond the end of the JSON document (to support fast vector loads). All bytes beyond the end of the JSON document are ignored (can be garbage) and the JSON document does not need to be NULL terminated. You can allocate a properly overallocated memory region with the provided `allocate_padded_buffer` function or simply by allocating your memory with extra capacity (`malloc(length + SIMDJSON_PADDING)`).
7474

7575
## Features
7676

7777
- We parse integers and floating-point numbers as separate types which allows us to support large 64-bit integers.
7878
- We do full UTF-8 validation as part of the parsing. (Parsers like fastjson, gason and dropbox json11 do not do UTF-8 validation.)
7979
- We fully validate the numbers. (Parsers like gason and ultranjson will accept `[0e+]` as valid JSON.)
8080
- We validate string content for unescaped characters. (Parsers like fastjson and ultrajson accept unescaped line breaks and tags in strings.)
81-
- The input string is unmodified.
81+
- The input string is unmodified. (Parsers like sajson and RapidJSON overwrite the input string.)
8282

8383
## Architecture
8484

include/simdjson/common_defs.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
#include <cassert>
44

5+
// the input buf should be readable up to buf + SIMDJSON_PADDING
6+
#define SIMDJSON_PADDING sizeof(__m256i)
7+
8+
59
typedef unsigned char u8;
610
typedef unsigned short u16;
711
typedef unsigned int u32;

include/simdjson/jsonparser.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,27 @@
77
#include "simdjson/stage2_flatten.h"
88
#include "simdjson/stage34_unified.h"
99

10+
11+
12+
1013
// Parse a document found in buf, need to preallocate ParsedJson.
1114
// Return false in case of a failure. You can also check validity
1215
// by calling pj.isValid(). The same ParsedJson can be reused.
16+
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
17+
// all bytes at and after buf + len are ignored (can be garbage)
1318
WARN_UNUSED
1419
bool json_parse(const u8 *buf, size_t len, ParsedJson &pj);
1520

21+
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
22+
// all bytes at and after buf + len are ignored (can be garbage)
1623
WARN_UNUSED
1724
static inline bool json_parse(const char * buf, size_t len, ParsedJson &pj) {
1825
return json_parse((const u8 *) buf, len, pj);
1926
}
2027

2128
// convenience function
29+
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING
30+
// all bytes at and after s.data()+s.size() are ignored (can be garbage)
2231
WARN_UNUSED
2332
static inline bool json_parse(const std::string_view &s, ParsedJson &pj) {
2433
return json_parse(s.data(), s.size(), pj);
@@ -27,16 +36,22 @@ static inline bool json_parse(const std::string_view &s, ParsedJson &pj) {
2736

2837
// Build a ParsedJson object. You can check validity
2938
// by calling pj.isValid(). This does memory allocation.
39+
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
40+
// all bytes at and after buf + len are ignored (can be garbage)
3041
WARN_UNUSED
3142
ParsedJson build_parsed_json(const u8 *buf, size_t len);
3243

3344
WARN_UNUSED
45+
// the input buf should be readable up to buf + len + SIMDJSON_PADDING
46+
// all bytes at and after buf + len are ignored (can be garbage)
3447
static inline ParsedJson build_parsed_json(const char * buf, size_t len) {
3548
return build_parsed_json((const u8 *) buf, len);
3649
}
3750

3851
// convenience function
3952
WARN_UNUSED
53+
// the input s should be readable up to s.data() + s.size() + SIMDJSON_PADDING
54+
// all bytes at and after s.data()+s.size() are ignored (can be garbage)
4055
static inline ParsedJson build_parsed_json(const std::string_view &s) {
4156
return build_parsed_json(s.data(), s.size());
4257
}

jsonchecker/pass06.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
true
1+
true

jsonchecker/pass07.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
null
1+
null

jsonchecker/pass08.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1
1+
1

jsonchecker/pass09.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
false
1+
false

src/jsonioutil.cpp

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,15 @@
33

44

55
char * allocate_padded_buffer(size_t length) {
6-
char *aligned_buffer;
7-
size_t paddedlength = ROUNDUP_N(length, 64);
8-
// allocate an extra sizeof(__m256i) just so we can always use AVX safely
9-
size_t totalpaddedlength = paddedlength + 1 + sizeof(__m256i);
10-
if (posix_memalign((void **)&aligned_buffer, 64, totalpaddedlength)) {
11-
throw std::runtime_error("Could not allocate sufficient memory");
6+
// we could do a simple malloc
7+
//return (char *) malloc(length + SIMDJSON_PADDING);
8+
// However, we might as well align to cache lines...
9+
char *padded_buffer;
10+
size_t totalpaddedlength = length + SIMDJSON_PADDING;
11+
if (posix_memalign((void **)&padded_buffer, 64, totalpaddedlength)) {
12+
return NULL;
1213
};
13-
return aligned_buffer;
14+
return padded_buffer;
1415
}
1516

1617
std::string_view get_corpus(std::string filename) {

src/jsonparser.cpp

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ bool json_parse(const u8 *buf, size_t len, ParsedJson &pj) {
1414
isok = flatten_indexes(len, pj);
1515
} else {
1616
return false;
17-
}
17+
}//printf("ok\n");
1818
if (isok) {
1919
isok = unified_machine(buf, len, pj);
20+
//printf("ok %d \n",isok);
2021
} else {
2122
return false;
2223
}

src/stage34_unified.cpp

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -125,24 +125,54 @@ bool unified_machine(const u8 *buf, size_t len, ParsedJson &pj) {
125125
}
126126
break;
127127
}
128-
case 't':
129-
if (!is_valid_true_atom(buf + idx)) {
128+
case 't': {
129+
// we need to make a copy to make sure that the string is NULL terminated.
130+
// this only applies to the JSON document made solely of the true value.
131+
// this will almost never be called in practice
132+
char * copy = (char *) malloc(len + SIMDJSON_PADDING);
133+
if(copy == NULL) goto fail;
134+
memcpy(copy, buf, len);
135+
copy[len] = '\0';
136+
if (!is_valid_true_atom((const u8 *)copy + idx)) {
137+
free(copy);
130138
goto fail;
131139
}
140+
free(copy);
132141
pj.write_tape(0, c);
133142
break;
134-
case 'f':
135-
if (!is_valid_false_atom(buf + idx)) {
143+
}
144+
case 'f': {
145+
// we need to make a copy to make sure that the string is NULL terminated.
146+
// this only applies to the JSON document made solely of the false value.
147+
// this will almost never be called in practice
148+
char * copy = (char *) malloc(len + SIMDJSON_PADDING);
149+
if(copy == NULL) goto fail;
150+
memcpy(copy, buf, len);
151+
copy[len] = '\0';
152+
if (!is_valid_false_atom((const u8 *)copy + idx)) {
153+
free(copy);
136154
goto fail;
137155
}
156+
free(copy);
138157
pj.write_tape(0, c);
139158
break;
140-
case 'n':
141-
if (!is_valid_null_atom(buf + idx)) {
159+
}
160+
case 'n': {
161+
// we need to make a copy to make sure that the string is NULL terminated.
162+
// this only applies to the JSON document made solely of the null value.
163+
// this will almost never be called in practice
164+
char * copy = (char *) malloc(len + SIMDJSON_PADDING);
165+
if(copy == NULL) goto fail;
166+
memcpy(copy, buf, len);
167+
copy[len] = '\0';
168+
if (!is_valid_null_atom((const u8 *)copy + idx)) {
169+
free(copy);
142170
goto fail;
143171
}
172+
free(copy);
144173
pj.write_tape(0, c);
145174
break;
175+
}
146176
case '0':
147177
case '1':
148178
case '2':
@@ -155,7 +185,8 @@ bool unified_machine(const u8 *buf, size_t len, ParsedJson &pj) {
155185
case '9': {
156186
// we need to make a copy to make sure that the string is NULL terminated.
157187
// this is done only for JSON documents made of a sole number
158-
char * copy = (char *) malloc(len + 1 + 64);
188+
// this will almost never be called in practice
189+
char * copy = (char *) malloc(len + SIMDJSON_PADDING);
159190
if(copy == NULL) goto fail;
160191
memcpy(copy, buf, len);
161192
copy[len] = '\0';
@@ -169,7 +200,8 @@ bool unified_machine(const u8 *buf, size_t len, ParsedJson &pj) {
169200
case '-': {
170201
// we need to make a copy to make sure that the string is NULL terminated.
171202
// this is done only for JSON documents made of a sole number
172-
char * copy = (char *) malloc(len + 1 + 64);
203+
// this will almost never be called in practice
204+
char * copy = (char *) malloc(len + SIMDJSON_PADDING);
173205
if(copy == NULL) goto fail;
174206
memcpy(copy, buf, len);
175207
copy[len] = '\0';

0 commit comments

Comments
 (0)