Improve 32-bit log standard math functions#2937
Conversation
|
Please note I used the following code to test the It may be useful for another use-case (e.g. other math functions) or possibly for tests. It is quite convenient to check all possible float values (since tests based on random numbers are not really reliable). |
|
Thank you for you contribution! I suggest splitting this PR into two because it addresses two different problems. To make it look neat, I suggest formatting the code (see the output of this job) and formatting the git message as follows: the first line is a short description, e.g., |
|
Could you wrap your test code into a trivial microbenchmark, similar to what we have here: ISPC Benchmarks? This will allow us to retain and integrate your testing code effectively. |
|
Does your change anyhow utilizes Sleef algorithms? |
We already have it for ispc/benchmarks/01_trivial/06_math.ispc Line 88 in ff58297 |
dbabokin
left a comment
There was a problem hiding this comment.
Please let us know if you need help fixing formatting issues - we use clang-format version 12 for formatting. I can push the fix to your branch, if you'd like.
For "copyright check" failure - it's not related to your changes, just rebase to main to fix that.
| float scaled = x_full * one_over_ln2; | ||
| float k_real = floor(scaled); | ||
| int k = (int)k_real; | ||
| int k = (int)k_real; // Be careful: float to int convertions are UB if the float value is too huge |
There was a problem hiding this comment.
I understand the comment in general, but I don't understand what useful information it brings. It has no call for action - i.e. if the overflow happens, then it happens. It seems there's nothing we can fix here, right? And I don't think we are going to tune this algorithm.
Please correct me if I get this wrong.
There was a problem hiding this comment.
I added this comment because it was the source of the initial bug : AFAIR we used the k variable in a case where it was UB because of the overflow. So I fixed that by using directly k_real instead and found useful to mention this UB so to warn future people about this problem (avoiding them to repeat the same mistake again). Maybe it is not needed. I can remove the comment if you think so.
There was a problem hiding this comment.
I don't think that we are going to hack this algorithm in the future. So this comment belongs more to the commit description, rather than to code itself. Let's remove it.
| int exponent; | ||
|
|
||
| const int NaN_bits = 0x7fc00000; | ||
| const int Neg_Inf_bits = 0xFF800000; |
There was a problem hiding this comment.
Where the log algorithm is coming from? Possibly we need to put a reference to the origin of the algorithm to make sure we are not breaking the license.
I know that algorithms are not copyrightable, only specific implementation is. But we'd like to understand the origin and give the credit.
There was a problem hiding this comment.
This is my own modification from the initial ISPC algorithm. I fixed the special values from the initial algorithm and then generated a new lower-order polynomial (with my a Python script optimization algorithm I written myself) since a it is a bit faster and the precision is enough thanks to the correction close to log(1). Finally, I added a correction based on the Taylor series at log(1). I just checked the functions/errors with the great Geogebra tool.
Thus, the origin is me and there is not breaking any license :) .
There was a problem hiding this comment.
That's impressive to say the least!
Two things then:
- Let's put implementation note before the algorithm to explain how you come up with it and what it precision characteristics are there. This information definitely belongs to the code, as we might find bugs in it, consider improvement, etc. And understanding the background is very important in this case.
- Have you "shopped around" for the alternatives? I.e. have you compared it's precision and performance with SVML and Sleef alternatives? It would be good to understand where it stands compared other proved implementations. If needed we can borrow algorithms from both libraries, it just needs to be done properly (license, copyright, etc).
There was a problem hiding this comment.
Thank you.
- There was already a comment in the code about the precision (in ULPs). I added information about how the method works.
- I never used Sleef yet, but it seems interesting/promising. I'm going to give it a try, but I think this can be done later, independently of this PR.
There was a problem hiding this comment.
I finally performed a quick check/benchmark with Sleef (the one available on Ubuntu 24.04 LTS system packages) since is relatively simple in practice (as long as we use a standard C interface and not LLVM IR one which is apparently not documented).
The C interface might be slightly slower since there is a function call. However, the Sleef functions are not masked so adding a masking introduces slight overhead. In the end, the two might cancel out so the performance may be representative of what we can get with direct a IR interface (if any).
Sleef provide 2 groups of functions: ones with 1 ULP error (typically due to rounding), and ones with a 3.5 ULP error (let say 4). The former can be significantly slower (i.e. up to 10 time slower) than the later and I think 4 ULP is a good reference for a project like ISPC where people focus on performance (and also because the current ISPC math log/exp function have a 4 to 10 ULP error with the new PRs). Not to mention the performance of 4 ULP functions are more stable. See the benchmark for more information: https://sleef.org/nontrigsp.png and https://sleef.org/trigsp.png .
I wrote a "corrected" implementation (not pushed) so to get a <5 ULP error (instead of <10 ULP) in order not to mix apples and oranges when comparing this with Sleef. It simply use a 9-degreee polynomial instead of a (current) 8-degree one.
Regarding performance, on my machine (Zen2), I get (in clocks, for the entire 32-bit float range):
This PR implementation (default math): 2.11e9
Corrected PR implementation: 2.26e9
Sleef avx2-i32x8 with C interface and no masking: 2.73e9
Thus, in this case, Sleef is about 30% slower than the less-precise PR implementation and about 20% slower than the similarly-precise corrected PR implementation :) !
A part of the overhead might come from the C call but probably not all of it.
It shows that this PR implementation is pretty good since it is competitive with Sleef (which is itself competitive with the SVML based on the provided Sleef's benchmarks).
Overall, aside from this specific log implementation, IMHO, Sleef is pretty interesting for ISPC since it has the benefit to provide:
- (pretty good) bounded-precision and support for special numbers (eg. NaN, Inf, subnormals, etc.)
- challenging performance
- certainly more tested functions than ISPC ones
- (fast) double-precision functions
- a flexible license certainly compatible with the ISPC one
- an alternative implementation (useful when an implementation has issues)
|
Also I measured the impact of the changes on Apple M3 chip with |
No. This is my own modification.
Ok, so there is finally nothing to add in the microbenchmark, isn't it? |
Yep, no changes. |
|
I improved the speed of the |
|
I tried to understand what was more precisely the numerical error so to use a more efficient approach for the correction close to x=1 and it worked well. The new approach is not only faster for the In the end, the new code is only 8% slower than the initial one for the I also fixed the formatting issues. |
f60d900 to
57ab827
Compare
|
It should be fine now and ready to be merged :) ! |
It looks like if we move @@ -3477,6 +3477,7 @@ __declspec(safe) static inline float log(float x_full) {
const bool use_nan = !(x_full >= 0.);
const bool use_inf = x_full == 0. || x_full == inf;
const bool exceptional = use_nan || use_inf;
+ const float exceptional_result = select(use_nan, NaN, select(x_full == inf, x_full, -inf));
const float one = 1.0;
const float patched = select(exceptional, one, x_full);
@@ -3509,7 +3510,6 @@ __declspec(safe) static inline float log(float x_full) {
result = log_from_exponent - x * result;
- const float exceptional_result = select(use_nan, NaN, select(x_full == inf, x_full, -inf));
return select(exceptional, exceptional_result, result);
}
}then no spills are generated. Could you try this? |
|
In practice, there are 6 spills in the current code and 7 spill if I move the line. The x16 target code is also 1% slower (not really significant). The x8 version seems not affected (as expected). Thus, it unfortunately does not help on my machine. By the way, note that I saw different results between Godbolt and my local ISPC (with no other change than this PR) and I do not know why yet. Did you use that to check spilling? I expected LLVM to track the dependency and perform a good register allocation so to reduce spilling as much as possible. In practice, such problem are not rare in ISPC with double-pumping. That being said, here, it is difficult for LLVM to produce a good code (except on AVX-512 thanks to cheap broadcasting embedded in instructions) based on the high number of registers to load and the number of variables. Still, there is a room for improvement since the x16 target is significantly slower than the x8 target and I think LLVM could theoretically just reorder most instructions so to generate operation on the first x8 items and then the second group. This is mostly the case but not totally. However, in practice, the boolean operation results in fused instructions preventing partially such optimization. This is a bit complicated since I think merging can be better when people do a lot of boolean operations on SSE/AVX-1/AVX-2/Neon targets while it is not a good idea otherwise except maybe in pathological cases (maybe it can reduce register pressure). I am not sure ISPC can make good decision while LLVM can certainly (thanks to a global optimization step). I am clearly not an expert of this so I may completely miss-understand how such thing work or how things should be improved. I think this problem is related to this issue. Still, I am not sure this is the only issue here. For example, generated instructions are sub-optimal when spilling happens: Here we can see a useless reload which can be replaced by a basic I think spilling can can responsible for up to a 30% slowdown on my machine so it may not fully explain the performance drop of the double-pumping version... |
I used my local build but the generated code looks similar to compiler-explorer's code link. I suggest to check which LLVM version you use. Release ISPC and trunk ISPC in compiler-explorer are built with the patched LLVM (see llvm_patches dir in repo). But I don't see difference on my machine between patched LLVM 17 and not-patched LLVM 18 for this example. |
|
TL;DR: the overhead of the target I profiled more intensively the double pumping implementation to better know what is happening on Zen2 with this PR. The code is bound by the number of instruction/uops which is about 50% higher with the PR. Since the IPC is good both before/after the modification (~2 FP uops/cycle), this makes sense to get a 50% slowdown. Here are few difference between the two versions for the target
Note that the loop calling The target While the generated assembly code can certainly be slightly improved, most of the overheads directly come from the approach/fix so I think there is not much to do (besides using possibly another better approach, if any).
Ok. I use the commit 9e74945 @ 20240806 and LLVM 18.1.3. I think I use a non-patched LLVM (from the Ubuntu system packages unless Ubuntu developers patched it). |
nurmukhametov
left a comment
There was a problem hiding this comment.
I measured performance both on GameDev tests and microbenchmarks. For both cases, avx2-i32x8 performance is 20~25 % worse (for tests using log). So, it looks we sacrifice performance for precision for avx2-32x8 also.
| return z + 0.693359375 * fe; | ||
| } else if (__math_lib == __math_lib_ispc) { | ||
|
|
||
| // Precision: <10 ULP (for all input values). |
There was a problem hiding this comment.
What about ULP for x=1.120036 (0x3f8f5d57)?
There was a problem hiding this comment.
The relative error formula gives 7.88694e-7 which looks like <10 ULP at first glance but as we discussed by mail, the formula is not very precise and the number of ULP can be a bit bigger than that. In practice the result is 0x3de829c5 instead of 0x3de829b9 for the truncated value. This means an error of 12 ULP so the error bound is not correct (because of the imprecise formula to check results).
There was a problem hiding this comment.
If I modify the checking code so to compute the error directly in ULP with the formula abs((uniform int64)intbits(res) - (uniform int64)intbits(dres)), then I get a maximum error of 12 ULP for normal numbers. This maximum error is reached for values closed to the one you mention. Note that this error is maximal for such values because of the problem mentioned above in this PR and present in the current ISPC code too (though the error is much bigger).
This means the actual error bound should not be "<10 ULP" but 12 ULP. Let say "<15 ULP" so to be safe.
For subnormal numbers (that is input values close to zeros and output close to -inf), the error is significantly bigger. I expected the mantissa to be >0.5 but I think this is not true with subnormal numbers. I need to check that. If this is the case, then fixing subnormal numbers requires the polynomial of the approximation to be defined on a wider range which will reduce the precision for other numbers (unless we use a higher-order polynomial which is more expensive).
There was a problem hiding this comment.
For the subnormal numbers, it turns out that the problem was already present in the current ISPC implementation so this is not a problem introduced by this PR. It can be fixed/mitigated, but again, at the expense of a slower implementation. Note the error can be pretty significant for small subnormal numbers. For example, for 1.20206185e-39 (0xd16dc), the results should be -89.616783 while it is -87.932335 (and -87.932327 for the current ISPC implementation which is not really better). I think this is certainly due to the __range_reduce_log function?
Tell me if you want to improve the precision for subnormal numbers (certainly at the expense of an even slower implementation).
A simple fix make the code 50% slower again (which is pretty expensive only for subnormal numbers). As a result, it becomes a bit slower than the Sleef implementation so I think there is certainly a more efficient approach for the overall function.
There was a problem hiding this comment.
This means the actual error bound should not be "<10 ULP" but 12 ULP. Let say "<15 ULP" so to be safe.
This is aligned with my estimation.
Yes. Note that the results are dependent of the target architecture. Out of curiosity, on which architecture/CPU did you run this benchmark? |
… its precision close to x=1 Fix invalid results for 32-bit log: - Fix a missing support for NaN and Inf in log. - Fix a numerical instability for input values close to x=1 in log (more specifically values in the range 1-1.25). The precision is now about 10_000 times more precise in this case. It should improve exp too. - Small performance drop in AVX-2 except for double pumping (~8% for avx2-i32x8 & 50% for avx2-i32x16 on Zen2).
57ab827 to
1026cc1
Compare
I tested on Intel i9-12900 (GameDev tests) and AMD 7840HS (microbenchmarks). |
|
I am closing this PR since it is superseded by the recent new one. |
Hello,
This PR fix few issues in the 32-bit implementation of
expandlog(for the default ISPC math mode). Most of the issues have been mentioned here. More specifically, this PR:expimplementation previously resulting in bad results for-Inf,NaNand big numbers.NaNandInfinlog.log(more specifically values in the range 1-1.25). The results was initially very inaccurate and they are now about 1000 times more precise in this case. This was especially a serious issue whenpowwas called with values likepow(1.001, 2000)orpow(1.01, 287.5).Regarding the
exp, I did not notice a significant performance impact.Regarding the
log, supporting NaN had a minor impact on performance. Surprisingly, checking infinity made the code nearly 50% slower for no apparent reason (just a|| x_full == inf) on theavx2-i32x16target, but no significant impact on theavx2-i32x8one. I guess this is certainly because of register spilling (due to the many constants), especially since the double-pumping target is now slower. The fix of the numerical instability also results in a significant impact on performance. In the end, all the modifications resulted in about 45% performance loss with the targetavx2-i32x8and about 75% with the targetavx2-i32x16on my (Zen2) laptop. This should only impact the default ISPC math version and not the fast one. Thus, overall, thelogfunction is significantly more reliable/accurate but this improvement is not free.The precision can be improved further but I think this does not worth the additional performance overhead : 5-10% so to get about 30 ULPs precision on input numbers in the range 1.0-1.25 instead of about 120 ULPs in the PR and even 10_000-100_000 ULPs currently. Let me know if you want to increase the precision further.