Skip to content

Isolated jkeiser fix for issue 1632: make it so that INCORRECT_TYPE is a recoverable condition in On Demand#1663

Merged
lemire merged 18 commits intomasterfrom
dlemire/backport_jkeiser_fix_for_issue1632
Jul 23, 2021
Merged

Isolated jkeiser fix for issue 1632: make it so that INCORRECT_TYPE is a recoverable condition in On Demand#1663
lemire merged 18 commits intomasterfrom
dlemire/backport_jkeiser_fix_for_issue1632

Conversation

@lemire
Copy link
Copy Markdown
Member

@lemire lemire commented Jul 19, 2021

In issue 1632, @jkeiser referred to two commits in an unrelated PR which constitute a fix for issue 1632. I have taken these two commits and turned them into a specific pull request.

To recap, the problem is that if you have a bad JSON input, and if you somehow ask for a bad type (e.g., as for an object where there is no {, or a boolean where there is false or true), then the front-end gets confused. The workaround is to ask for the type before, as a precaution. It works because asking for the type does not advance the iterator... but asking for a value does advance the pointer even when the type is not right.

This PR changes this behavior.

I have had to extend the work done by @jkeiser in a non-trivial manner. My change is rather intrusive, more so that I'd like... but it felt right.

The original On Demand front-end always advance if you are asking for a value (array, object, bool, double) and you are just before it. If you ask for a Boolean value, it will look at the current character, decide whether it is a Boolean or not, and then move forward (whether it is a Boolean or not).

I found this quite confusing when dealing with errored document. So I have moved the code to a peek/advance model.

The general idea of the advance_... methods and the peek_* methods is that you first peek and check that you have desired type. If you do, and only if you do, then you advance.

Suppose you always advance. Look at the 'value' matching the key "shadowable" in the following example...
({"globals":{"a":{"shadowable":[}}}})
If the user thinks it is a Boolean and asks for it, then we check the '[', decide it is not a Boolean, but still move into the next character ('}'). Now we are left pointing at '}' right after a '['. And we have not yet reported an error, only that we do not have a Boolean.

If, instead, you just stand your ground until it is content that you know, then you will only even move beyond the '[' if the user tells you that you have an array. So you will be at the '}' character inside the array and, hopefully, you will then catch the error because an array cannot start with '}', but the code processing Boolean values does not know this.

So the contract is: first call 'peek_...' and then call 'advance_...' only if you have determined that it is a type you can handle.

Unfortunately, it makes the code more verbose, longer and maybe more error prone. And there might be (???) a downside performance-wise. It should be streamlined, if not now, then in the future.

This PR also integrates elements of @jkeiser 's code that are meant to do away with input padding but I think it is relatively minor and the offending parts were commented out with explanations.

Fixes #1664
Fixes #1632

Importantly, this is the basis for a new PR that would lift the padding requirement #1665

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Jul 19, 2021

Turns out that @jkeiser fix only partially resolves the issue. I need to do more work. (Work done.)

@lemire lemire requested a review from jkeiser July 19, 2021 21:15
@lemire
Copy link
Copy Markdown
Member Author

lemire commented Jul 19, 2021

@strager It is possible that this PR could resolve your issues. If you could check it out, that would be great. I have included your tests as well as extensions of your tests.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Jul 19, 2021

@jkeiser For obvious reasons, I would really like your input on this since it is yet another rather large change deep into the front-end. I do think, however, that it is an easy to understand change.

@lemire lemire added this to the 1.0 milestone Jul 19, 2021
@lemire lemire added the on demand Related to simdjson::ondemand functionality label Jul 19, 2021
@lemire lemire changed the title Isolated jkeiser fix for issue 1632 Isolated jkeiser fix for issue 1632: make it so that One way to understand the current simdjson (prior to our current attempts) is that INCORRECT_TYPE is a recoverable condition in On Demand Jul 19, 2021
@lemire lemire changed the title Isolated jkeiser fix for issue 1632: make it so that One way to understand the current simdjson (prior to our current attempts) is that INCORRECT_TYPE is a recoverable condition in On Demand Isolated jkeiser fix for issue 1632: make it so that INCORRECT_TYPE is a recoverable condition in On Demand Jul 19, 2021
@strager
Copy link
Copy Markdown
Contributor

strager commented Jul 20, 2021

This patch works for my application. 👍

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Jul 20, 2021

@strager The feedback is much appreciated.

Please keep finding bugs!!! :-)

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Jul 23, 2021

I am going to merge this eagerly. We can revert some of the changes following feedback by @jkeiser later.

@lemire lemire merged commit 47a62db into master Jul 23, 2021
@lemire lemire deleted the dlemire/backport_jkeiser_fix_for_issue1632 branch July 23, 2021 15:32
@jkeiser
Copy link
Copy Markdown
Member

jkeiser commented Aug 4, 2021

I never verified whether peek/next or advance made any difference performance wise, so either model is fine (I switched back and forth several times, in fact). Did you happen to see a performance change after this?

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Aug 4, 2021

@jkeiser

There is potentially a large performance drop since v0.9.

v0.9.7 current main branch loss
partial_tweets 3.13 GB/s 2.95 GB/s 6%
 large_random 0.80 GB/s 0.73 GB/s 10%
kostya 2.23 GB/s 2.11 GB/s 6%
distinct_user_id 3.64 GB/s 2.95 GB/s 25%
find_tweet 4.84 GB/s 4.79 GB/s 1%
 top_tweet 3.44 GB/s 3.43 GB/s 0%

I will open an issue.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Aug 4, 2021

@jkeiser Note that we need a solution. We can't just advance always.

But we certainly should study the performance issue.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Aug 4, 2021

@jkeiser This PR ( 47a62db ) is marked in this plot I just generated... It does not look like it caused a regression on its own.

Screen Shot 2021-08-04 at 4 43 58 PM

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Aug 4, 2021

Interactive plot at https://simdjson.org/plots.html

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Aug 4, 2021

Did you happen to see a performance change after this?

The answer is no, I think. Our performance has evolved, but this particular PR did not lead to a performance drop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

on demand Related to simdjson::ondemand functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Typo in an internal comment INCORRECT_TYPE can mask actual structure problems

3 participants