JavaScriptExpert
diff --git a/‎CMakeLists.txt‎
Lines changed: 15 additions & 1 deletion b/‎CMakeLists.txt‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎HACKING.md‎
Lines changed: 209 additions & 0 deletions b/‎HACKING.md‎
Lines changed: 209 additions & 0 deletions
diff --git a/‎singleheader/amalgamate_demo.cpp‎
Lines changed: 1 addition & 1 deletion b/‎singleheader/amalgamate_demo.cpp‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎singleheader/simdjson.cpp‎
Lines changed: 15 additions & 15 deletions b/‎singleheader/simdjson.cpp‎
Lines changed: 15 additions & 15 deletions
@@ -38,13 +38,27 @@ add_subdirectory(singleheader)
 #
 # Compile tools / tests / benchmarks
 #
-
 add_subdirectory(dependencies)
 add_subdirectory(tests)
 add_subdirectory(examples)
 add_subdirectory(benchmark)
 add_subdirectory(fuzz)
 
+#
+# Source files should be just ASCII
+#
+find_program(FIND find)
+find_program(FILE file)
+find_program(GREP grep)
+if((FIND) AND (FILE) AND (GREP))
+    add_test(
+      NAME "just_ascii"
+      COMMAND sh -c "${FIND}  include src windows tools singleheader tests examples benchmark -path benchmark/checkperf-reference -prune -name '*.h'  -o -name '*.cpp' -type f  -exec ${FILE} '{}' \; |${GREP} -v ASCII || exit 0  && exit 1"
+      WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
+    )
+endif()
+
+
 #
 # CPack
 #
 
@@ -40,6 +40,7 @@ We have few hard rules, but we have some:
 
 - Printing to standard output or standard error (`stderr`, `stdout`, `std::cerr`, `std::cout`) in the core library is forbidden. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Compiled code should not write to stdout or stderr".
 - Calls to `abort()` are forbidden in the core library. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Under no circumstances should your compiled code ever call abort or exit".
+- All source code files (.h, .cpp) must be ASCII.
 
 Tools, tests and benchmarks are not held to these same strict rules.
 
 
@@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
 
 > { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
 
+
+
+### UTF-8 validation (lookup2)
+
+The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
+
+This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte  character).
+
+####  Algorithm
+
+This algorithm compares *expected* continuation characters with *actual* continuation bytes, and emits an error anytime there is a mismatch.
+
+For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
+characters, the file will look like this:
+
+| Character             | 𝄞  |    |    |    | ₿  |    |    | ֏  |    | a  | b  |
+|-----------------------|----|----|----|----|----|----|----|----|----|----|----|
+| Character Length      |  4 |    |    |    |  3 |    |    |  2 |    |  1 |  1 |
+| Byte                  | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
+| is_second_byte        |    |  X |    |    |    |  X |    |    |  X |    |    |
+| is_third_byte         |    |    |  X |    |    |    |  X |    |    |    |    |
+| is_fourth_byte        |    |    |    |  X |    |    |    |    |    |    |    |
+| expected_continuation |    |  X |  X |  X |    |  X |  X |    |  X |    |    |
+| is_continuation       |    |  X |  X |  X |    |  X |  X |    |  X |    |    |
+
+The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
+
+- **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
+  part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
+  floating around extra outside of any character, or that there is an illegal 5-byte character,
+  or maybe it's at the beginning of the file before any characters have started; but it's an
+  error in all these cases.
+- **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
+  we started a new character before we were finished with the current one.
+
+####  Getting the Previous Bytes
+
+Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
+character, we need to "shift the bytes" to find that out. This is what they mean:
+
+- `is_continuation`: if the current byte is a continuation.
+- `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
+- `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
+- `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
+
+We use shuffles to go n bytes back, selecting part of the current `input` and part of the
+`prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
+function, because the 1-byte-back data is used by other checks as well.
+
+####   Getting the Continuation Mask
+
+Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
+numbers, using signed `<` and `>` operations to check if they are continuations or leads.
+In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
+Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
+
+In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
+respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
+Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
+
+When treated as signed numbers, they look like this:
+
+| Type         | High Bits  | Binary Range | Signed |
+|--------------|------------|--------------|--------|
+| ASCII        | `0`        | `01111111`   |   127  |
+|              |            | `00000000`   |     0  |
+| 4+-Byte Lead | `1111`     | `11111111`   |    -1  |
+|              |            | `11110000    |   -16  |
+| 3-Byte Lead  | `1110`     | `11101111`   |   -17  |
+|              |            | `11100000    |   -32  |
+| 2-Byte Lead  | `110`      | `11011111`   |   -33  |
+|              |            | `11000000    |   -64  |
+| Continuation | `10`       | `10111111`   |   -65  |
+|              |            | `10000000    |  -128  |
+
+This makes it pretty easy to get the continuation mask! It's just a single comparison:
+
+```
+is_continuation = input < -64`
+```
+
+We can do something similar for the others, but it takes two comparisons instead of one: "is
+the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
+`> -64`. Surely we can do better, they're right next to each other!
+
+####  Getting the is_xxx Masks: Shifting the Range
+
+Notice *why* continuations were a single comparison. The actual *range* would require two
+comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
+that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
+just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
+
+Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
+ASCII down into the negative, and puts 4+-Byte Lead at the top:
+
+| Type                 | High Bits  | Binary Range | Signed |
+|----------------------|------------|--------------|-------|
+| 4+-Byte Lead (+ 127) | `0111`     | `01111111`   |   127 |
+|                      |            | `01110000    |   112 |
+|----------------------|------------|--------------|-------|
+| 3-Byte Lead (+ 127)  | `0110`     | `01101111`   |   111 |
+|                      |            | `01100000    |    96 |
+|----------------------|------------|--------------|-------|
+| 2-Byte Lead (+ 127)  | `010`      | `01011111`   |    95 |
+|                      |            | `01000000    |    64 |
+|----------------------|------------|--------------|-------|
+| Continuation (+ 127) | `00`       | `00111111`   |    63 |
+|                      |            | `00000000    |     0 |
+|----------------------|------------|--------------|-------|
+| ASCII (+ 127)        | `1`        | `11111111`   |    -1 |
+|                      |            | `10000000`   |  -128 |
+|----------------------|------------|--------------|-------|
+
+*Now* we can use signed `>` on all of them:
+
+```
+prev1 = input.prev<1>
+prev2 = input.prev<2>
+prev3 = input.prev<3>
+prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
+prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
+prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
+is_second_byte = prev1_flipped > 63;2+-byte lead
+is_third_byte  = prev2_flipped > 95;3+-byte lead
+is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
+```
+
+NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
+of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
+`^`'s at a time on Haswell, but only 2 `+`'s).
+
+That doesn't look like it saved us any instructions, did it? Well, because we're adding the
+same number to all of them, we can save one of those `+ 128` operations by assembling
+`prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
+to it. One more instruction saved!
+
+```
+prev1 = input.prev<1>
+prev3 = input.prev<3>
+prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
+prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
+prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3  
+```
+
+####  Bringing It All Together: Detecting the Errors
+
+At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
+All we have left to do is check if they match!
+
+```
+return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
+```
+
+But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
+parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
+bitwise operations, and we're only using 1!
+
+####  Epilogue: Addition For Booleans
+
+There is one big case the above code doesn't explicitly talk about--what if is_second_byte
+and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
+to each other (or any combination), and the continuation could be part of either of them!
+Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
+
+Never fear, though. If that situation occurs, we'll already have detected that the second
+leading byte was an error, because it was supposed to be a part of the preceding multibyte
+character, but it *wasn't a continuation*.
+
+We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
+`&`, which is both interesting and possibly useful (even though we're not using it here). It
+exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
+comparisons were giving us numbers!
+
+Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
+circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
+`(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
+*both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
+*any* nonzero value is treated as an error (not just -1), we're just fine here :)
+
+Further, if *more than one* multibyte character overlaps,
+`is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
+from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
+considered an error.
+
+One reason you might want to do this is parallelism. ^ and | are not associative, so
+(A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
+you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
+be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
+then adds the result together. Same number of operations, but if the processor can run
+independent things in parallel (which most can), it runs faster.
+
+This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
+a super nice advantage in that more of them can be run at the same time (they can run on 3
+ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
+saving us the cycle we would have earned by using +. Even more, using an instruction with a
+wider array of ports can help *other* code run ahead, too, since these instructions can "get
+out of the way," running on a port other instructions can't.
+
+####  Epilogue II: One More Trick
+
+There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
+for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
+check_special_cases()--but we'll talk about that there :)
+
+
+
+
 ## About the Project
 
 ### Bindings and Ports of simdjson
@@ -420,6 +627,8 @@ make allparsingcompetition
 Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
 a table-oriented output that can be conveniently parsed by other tools.
 
+
+
 ### Various References
 
 - [Google double-conv](https://github.com/google/double-conversion/)
 
@@ -1,4 +1,4 @@
-/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
+/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
 
 #include <iostream>
 #include "simdjson.h"
 
@@ -1,4 +1,4 @@
-/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
+/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
 /* begin file src/simdjson.cpp */
 #include "simdjson.h"
 
@@ -3180,10 +3180,10 @@ namespace utf8_validation {
   // This algorithm compares *expected* continuation characters with *actual* continuation bytes,
   // and emits an error anytime there is a mismatch.
   //
-  // For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
+  // For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
   // characters, the file will look like this:
   //
-  // | Character             | 𝄞  |    |    |    | ₿  |    |    | ֏  |    | a  | b  |
+  // | Character             |   |    |    |    |   |    |    |   |    | a  | b  |
   // |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
   // | Character Length      |  4 |    |    |    |  3 |    |    |  2 |    |  1 |  1 |
   // | Byte                  | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -4049,10 +4049,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
   // If you consume a large value and you map it to "infinity", you will no
   // longer be able to serialize back a standard-compliant JSON. And there is
   // no realistic application where you might need values so large than they
-  // can't fit in binary64. The maximal value is about  1.7976931348623157 ×
+  // can't fit in binary64. The maximal value is about  1.7976931348623157 x
   // 10^308 It is an unimaginable large number. There will never be any piece of
   // engineering involving as many as 10^308 parts. It is estimated that there
-  // are about 10^80 atoms in the universe.  The estimate for the total number
+  // are about 10^80 atoms in the universe.  The estimate for the total number
   // of electrons is similar. Using a double-precision floating-point value, we
   // can represent easily the number of atoms in the universe. We could  also
   // represent the number of ways you can pick any three individual atoms at
@@ -5872,10 +5872,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
   // If you consume a large value and you map it to "infinity", you will no
   // longer be able to serialize back a standard-compliant JSON. And there is
   // no realistic application where you might need values so large than they
-  // can't fit in binary64. The maximal value is about  1.7976931348623157 ×
+  // can't fit in binary64. The maximal value is about  1.7976931348623157 x
   // 10^308 It is an unimaginable large number. There will never be any piece of
   // engineering involving as many as 10^308 parts. It is estimated that there
-  // are about 10^80 atoms in the universe.  The estimate for the total number
+  // are about 10^80 atoms in the universe.  The estimate for the total number
   // of electrons is similar. Using a double-precision floating-point value, we
   // can represent easily the number of atoms in the universe. We could  also
   // represent the number of ways you can pick any three individual atoms at
@@ -8142,10 +8142,10 @@ namespace utf8_validation {
   // This algorithm compares *expected* continuation characters with *actual* continuation bytes,
   // and emits an error anytime there is a mismatch.
   //
-  // For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
+  // For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
   // characters, the file will look like this:
   //
-  // | Character             | 𝄞  |    |    |    | ₿  |    |    | ֏  |    | a  | b  |
+  // | Character             |   |    |    |    |   |    |    |   |    | a  | b  |
   // |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
   // | Character Length      |  4 |    |    |    |  3 |    |    |  2 |    |  1 |  1 |
   // | Byte                  | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -9015,10 +9015,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
   // If you consume a large value and you map it to "infinity", you will no
   // longer be able to serialize back a standard-compliant JSON. And there is
   // no realistic application where you might need values so large than they
-  // can't fit in binary64. The maximal value is about  1.7976931348623157 ×
+  // can't fit in binary64. The maximal value is about  1.7976931348623157 x
   // 10^308 It is an unimaginable large number. There will never be any piece of
   // engineering involving as many as 10^308 parts. It is estimated that there
-  // are about 10^80 atoms in the universe.  The estimate for the total number
+  // are about 10^80 atoms in the universe.  The estimate for the total number
   // of electrons is similar. Using a double-precision floating-point value, we
   // can represent easily the number of atoms in the universe. We could  also
   // represent the number of ways you can pick any three individual atoms at
@@ -11254,10 +11254,10 @@ namespace utf8_validation {
   // This algorithm compares *expected* continuation characters with *actual* continuation bytes,
   // and emits an error anytime there is a mismatch.
   //
-  // For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
+  // For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
   // characters, the file will look like this:
   //
-  // | Character             | 𝄞  |    |    |    | ₿  |    |    | ֏  |    | a  | b  |
+  // | Character             |   |    |    |    |   |    |    |   |    | a  | b  |
   // |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
   // | Character Length      |  4 |    |    |    |  3 |    |    |  2 |    |  1 |  1 |
   // | Byte                  | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -12130,10 +12130,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
   // If you consume a large value and you map it to "infinity", you will no
   // longer be able to serialize back a standard-compliant JSON. And there is
   // no realistic application where you might need values so large than they
-  // can't fit in binary64. The maximal value is about  1.7976931348623157 ×
+  // can't fit in binary64. The maximal value is about  1.7976931348623157 x
   // 10^308 It is an unimaginable large number. There will never be any piece of
   // engineering involving as many as 10^308 parts. It is estimated that there
-  // are about 10^80 atoms in the universe.  The estimate for the total number
+  // are about 10^80 atoms in the universe.  The estimate for the total number
   // of electrons is similar. Using a double-precision floating-point value, we
   // can represent easily the number of atoms in the universe. We could  also
   // represent the number of ways you can pick any three individual atoms at
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */`
	`1`	`+/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */`
`2`	`2`
`3`	`3`	`#include <iostream>`
`4`	`4`	`#include "simdjson.h"`