bpo-46258: Streamline isqrt fast path #30333
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes some minor simplifications to the implementation of
math.isqrt. Those simplifications improve the speed ofmath.isqrtfor inputs smaller than2**64by around 20% on my machine.In detail:
_approximate_sqrt, use a lookup table based on the topmost 8 bits to replace the first two Newton iterations. This reduces the total number of divisions needed from 4 to 2._approximate_sqrtfromuint64_ttouint32_t.u * u - 1U >= mtest with the simpler but equivalent testu * u > m._approximate_sqrt(though I'd expect most compilers not to need this)._approximate_sqrtbe inlined.On my Intel x64 machine, the assembly produced by Clang involves one
divlinstruction and onedivqinstruction.There's one subtle but important change here: in the previous implementation of
_approximate_sqrt, it was possible for the returned value to be exactly2**32(but no larger), so we couldn't assume that it would fit in auint32_t. With the new code, the returned value is always< 2**32. It's this fact that makes it possible to change the return type, and to replace theu*u - 1 >= mtest withu*u > m. To prove this bound: if we ignore overflow for the moment, since the result of_approximate_sqrtis always within1of the true square root (following the proof already outlined in the comments), the only way that the result of the final addition can be2**32(overflowing to0) is if the inputnexceeds(2**32 - 1)**2. But in that case we know most of the top bits ofn: the top 32 bits are either0xfffffffeor0xffffffff, and so we can trace through the first steps of_approximate_sqrt- the value ofuretrieved from the lookup table is255, then the value assigned touin the following line is65536. Then it's easy to see that ifu = 65536,(n >> 17 / u) < 2**31andu << 15 = 2**31, so the result of the addition fits in auint32_t.https://bugs.python.org/issue46258