@@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
369369
370370> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
371371
372+
373+
374+ ### UTF-8 validation (lookup2)
375+
376+ The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
377+
378+ This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte character).
379+
380+ #### Algorithm
381+
382+ This algorithm compares * expected* continuation characters with * actual* continuation bytes, and emits an error anytime there is a mismatch.
383+
384+ For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
385+ characters, the file will look like this:
386+
387+ | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
388+ | -----------------------| ----| ----| ----| ----| ----| ----| ----| ----| ----| ----| ----|
389+ | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
390+ | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
391+ | is_second_byte | | X | | | | X | | | X | | |
392+ | is_third_byte | | | X | | | | X | | | | |
393+ | is_fourth_byte | | | | X | | | | | | | |
394+ | expected_continuation | | X | X | X | | X | X | | X | | |
395+ | is_continuation | | X | X | X | | X | X | | X | | |
396+
397+ The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
398+
399+ - ** Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
400+ part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
401+ floating around extra outside of any character, or that there is an illegal 5-byte character,
402+ or maybe it's at the beginning of the file before any characters have started; but it's an
403+ error in all these cases.
404+ - ** Missing Continuations:** Any second, third or fourth byte that * isn't* a continuation is an error, because that means
405+ we started a new character before we were finished with the current one.
406+
407+ #### Getting the Previous Bytes
408+
409+ Because we want to know if a byte is the * second* (or third, or fourth) byte of a multibyte
410+ character, we need to "shift the bytes" to find that out. This is what they mean:
411+
412+ - ` is_continuation ` : if the current byte is a continuation.
413+ - ` is_second_byte ` : if 1 byte back is the start of a 2-, 3- or 4-byte character.
414+ - ` is_third_byte ` : if 2 bytes back is the start of a 3- or 4-byte character.
415+ - ` is_fourth_byte ` : if 3 bytes back is the start of a 4-byte character.
416+
417+ We use shuffles to go n bytes back, selecting part of the current ` input ` and part of the
418+ ` prev_input ` (search for ` .prev<1> ` , ` .prev<2> ` , etc.). These are passed in by the caller
419+ function, because the 1-byte-back data is used by other checks as well.
420+
421+ #### Getting the Continuation Mask
422+
423+ Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
424+ numbers, using signed ` < ` and ` > ` operations to check if they are continuations or leads.
425+ In fact, we treat the numbers as * signed* , partly because it helps us, and partly because
426+ Intel's SIMD presently only offers signed ` < ` and ` > ` operations (not unsigned ones).
427+
428+ In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
429+ respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
430+ Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
431+
432+ When treated as signed numbers, they look like this:
433+
434+ | Type | High Bits | Binary Range | Signed |
435+ | --------------| ------------| --------------| --------|
436+ | ASCII | ` 0 ` | ` 01111111 ` | 127 |
437+ | | | ` 00000000 ` | 0 |
438+ | 4+-Byte Lead | ` 1111 ` | ` 11111111 ` | -1 |
439+ | | | `11110000 | -16 |
440+ | 3-Byte Lead | ` 1110 ` | ` 11101111 ` | -17 |
441+ | | | `11100000 | -32 |
442+ | 2-Byte Lead | ` 110 ` | ` 11011111 ` | -33 |
443+ | | | `11000000 | -64 |
444+ | Continuation | ` 10 ` | ` 10111111 ` | -65 |
445+ | | | `10000000 | -128 |
446+
447+ This makes it pretty easy to get the continuation mask! It's just a single comparison:
448+
449+ ```
450+ is_continuation = input < -64`
451+ ```
452+
453+ We can do something similar for the others, but it takes two comparisons instead of one: "is
454+ the start of a 4-byte character" is ` < -32 ` and ` > -65 ` , for example. And 2+ bytes is ` < 0 ` and
455+ ` > -64 ` . Surely we can do better, they're right next to each other!
456+
457+ #### Getting the is_xxx Masks: Shifting the Range
458+
459+ Notice * why* continuations were a single comparison. The actual * range* would require two
460+ comparisons--` < -64 ` and ` > -129 ` --but all characters are always greater than -128, so we get
461+ that for free. In fact, if we had * unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
462+ just as easy: 4+ would be ` > 239 ` , 3+ would be ` > 223 ` , and 2+ would be ` > 191 ` .
463+
464+ Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
465+ ASCII down into the negative, and puts 4+-Byte Lead at the top:
466+
467+ | Type | High Bits | Binary Range | Signed |
468+ | ----------------------| ------------| --------------| -------|
469+ | 4+-Byte Lead (+ 127) | ` 0111 ` | ` 01111111 ` | 127 |
470+ | | | `01110000 | 112 |
471+ | ----------------------| ------------| --------------| -------|
472+ | 3-Byte Lead (+ 127) | ` 0110 ` | ` 01101111 ` | 111 |
473+ | | | `01100000 | 96 |
474+ | ----------------------| ------------| --------------| -------|
475+ | 2-Byte Lead (+ 127) | ` 010 ` | ` 01011111 ` | 95 |
476+ | | | `01000000 | 64 |
477+ | ----------------------| ------------| --------------| -------|
478+ | Continuation (+ 127) | ` 00 ` | ` 00111111 ` | 63 |
479+ | | | `00000000 | 0 |
480+ | ----------------------| ------------| --------------| -------|
481+ | ASCII (+ 127) | ` 1 ` | ` 11111111 ` | -1 |
482+ | | | ` 10000000 ` | -128 |
483+ | ----------------------| ------------| --------------| -------|
484+
485+ * Now* we can use signed ` > ` on all of them:
486+
487+ ```
488+ prev1 = input.prev<1>
489+ prev2 = input.prev<2>
490+ prev3 = input.prev<3>
491+ prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
492+ prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
493+ prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
494+ is_second_byte = prev1_flipped > 63;2+-byte lead
495+ is_third_byte = prev2_flipped > 95;3+-byte lead
496+ is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
497+ ```
498+
499+ NOTE: we use ` ^ 0x80 ` instead of ` + 128 ` in the code, which accomplishes the same thing, and even takes the same number
500+ of cycles as ` + ` , but on many Intel architectures can be parallelized better (you can do 3
501+ ` ^ ` 's at a time on Haswell, but only 2 ` + ` 's).
502+
503+ That doesn't look like it saved us any instructions, did it? Well, because we're adding the
504+ same number to all of them, we can save one of those ` + 128 ` operations by assembling
505+ ` prev2_flipped ` out of prev 1 and prev 3 instead of assembling it from input and adding 128
506+ to it. One more instruction saved!
507+
508+ ```
509+ prev1 = input.prev<1>
510+ prev3 = input.prev<3>
511+ prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
512+ prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
513+ prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
514+ ```
515+
516+ #### Bringing It All Together: Detecting the Errors
517+
518+ At this point, we have ` is_continuation ` , ` is_first_byte ` , ` is_second_byte ` and ` is_third_byte ` .
519+ All we have left to do is check if they match!
520+
521+ ```
522+ return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
523+ ```
524+
525+ But wait--there's more. The above statement is only 3 operations, but they * cannot be done in
526+ parallel* . You have to do 2 ` | ` 's and then 1 ` & ` . Haswell, at least, has 3 ports that can do
527+ bitwise operations, and we're only using 1!
528+
529+ #### Epilogue: Addition For Booleans
530+
531+ There is one big case the above code doesn't explicitly talk about--what if is_second_byte
532+ and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
533+ to each other (or any combination), and the continuation could be part of either of them!
534+ Our algorithm using ` & ` and ` | ` won't detect that the continuation byte is problematic.
535+
536+ Never fear, though. If that situation occurs, we'll already have detected that the second
537+ leading byte was an error, because it was supposed to be a part of the preceding multibyte
538+ character, but it * wasn't a continuation* .
539+
540+ We could stop here, but it turns out that we can fix it using ` + ` and ` - ` instead of ` | ` and
541+ ` & ` , which is both interesting and possibly useful (even though we're not using it here). It
542+ exploits the fact that in SIMD, a * true* value is -1, and a * false* value is 0. So those
543+ comparisons were giving us numbers!
544+
545+ Given that, if you do ` is_second_byte + is_third_byte + is_fourth_byte ` , under normal
546+ circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
547+ ` (is_second_byte + is_third_byte + is_fourth_byte) - is_continuation ` will yield 0 only if
548+ * both* or * neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
549+ * any* nonzero value is treated as an error (not just -1), we're just fine here :)
550+
551+ Further, if * more than one* multibyte character overlaps,
552+ ` is_second_byte + is_third_byte + is_fourth_byte ` will be -2 or -3! Subtracting ` is_continuation `
553+ from * that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
554+ considered an error.
555+
556+ One reason you might want to do this is parallelism. ^ and | are not associative, so
557+ (A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
558+ you do B | C -> | A -> ^ D. But addition and subtraction * are* associative: (A + B + C) - D can
559+ be written as ` (A + B) + (C - D) ` . This means you can do A + B and C - D at the same time, and
560+ then adds the result together. Same number of operations, but if the processor can run
561+ independent things in parallel (which most can), it runs faster.
562+
563+ This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
564+ a super nice advantage in that more of them can be run at the same time (they can run on 3
565+ ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
566+ saving us the cycle we would have earned by using +. Even more, using an instruction with a
567+ wider array of ports can help * other* code run ahead, too, since these instructions can "get
568+ out of the way," running on a port other instructions can't.
569+
570+ #### Epilogue II: One More Trick
571+
572+ There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
573+ for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
574+ check_special_cases()--but we'll talk about that there :)
575+
576+
577+
578+
372579## About the Project
373580
374581### Bindings and Ports of simdjson
@@ -420,6 +627,8 @@ make allparsingcompetition
420627Both the ` parsingcompetition ` and ` allparsingcompetition ` tools take a ` -t ` flag which produces
421628a table-oriented output that can be conveniently parsed by other tools.
422629
630+
631+
423632### Various References
424633
425634- [ Google double-conv] ( https://github.com/google/double-conversion/ )
0 commit comments