Skip to content

Improve 32-bit log standard math functions#2937

Closed
zephyr111 wants to merge 1 commit intoispc:mainfrom
zephyr111:better_math_funcs
Closed

Improve 32-bit log standard math functions#2937
zephyr111 wants to merge 1 commit intoispc:mainfrom
zephyr111:better_math_funcs

Conversation

@zephyr111
Copy link
Copy Markdown
Collaborator

@zephyr111 zephyr111 commented Aug 7, 2024

Hello,

This PR fix few issues in the 32-bit implementation of exp and log (for the default ISPC math mode). Most of the issues have been mentioned here. More specifically, this PR:

  • Fix an overflow in the exp implementation previously resulting in bad results for -Inf, NaN and big numbers.
  • Fix a missing support for NaN and Inf in log.
  • Fix a numerical instability for input values close to 1 in log (more specifically values in the range 1-1.25). The results was initially very inaccurate and they are now about 1000 times more precise in this case. This was especially a serious issue when pow was called with values like pow(1.001, 2000) or pow(1.01, 287.5).

Regarding the exp, I did not notice a significant performance impact.

Regarding the log, supporting NaN had a minor impact on performance. Surprisingly, checking infinity made the code nearly 50% slower for no apparent reason (just a || x_full == inf) on the avx2-i32x16 target, but no significant impact on the avx2-i32x8 one. I guess this is certainly because of register spilling (due to the many constants), especially since the double-pumping target is now slower. The fix of the numerical instability also results in a significant impact on performance. In the end, all the modifications resulted in about 45% performance loss with the target avx2-i32x8 and about 75% with the target avx2-i32x16 on my (Zen2) laptop. This should only impact the default ISPC math version and not the fast one. Thus, overall, the log function is significantly more reliable/accurate but this improvement is not free.

The precision can be improved further but I think this does not worth the additional performance overhead : 5-10% so to get about 30 ULPs precision on input numbers in the range 1.0-1.25 instead of about 120 ULPs in the PR and even 10_000-100_000 ULPs currently. Let me know if you want to increase the precision further.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

Please note I used the following code to test the exp and log functions:

void test_ext(uniform float x, uniform bool verbose)
{
    const uniform float res = log(x);
    const uniform float dres = (uniform float)log((uniform double)x);
    const uniform float inf = floatbits(0x7F800000);

    if(verbose)
        print("%: %\n", x, res);

    if(isnan(res) != isnan(dres))
        print("ERROR: NaN mismatch: x=%\n", x);
    else if(res != dres && (abs(res) == inf || abs(dres) == inf))
        print("ERROR: Inf mismatch: x=% res=% dres=%\n", x, res, dres);
    else if(res != dres && abs(res-dres)/dres >= 130e-7) // Usual error check: >= 4~5 ULP
        print("ERROR: value mismatch: x=% res=% dres=% error_x1000=%\n", x, res, dres, abs(res-dres)*1000/dres);

    if(verbose)
        print("\n");
}

void test(uniform float x)
{
    test_ext(x, true);
}

extern "C" uniform int main()
{
    uniform bool full_check = true;
    uniform bool measure_perf = true;

    uniform float pinf  = floatbits(0x7F800000);
    uniform float ninf  = floatbits(0xFF800000);
    uniform float pnan  = floatbits(0x7FC00000);
    uniform float nnan  = floatbits(0xFFC00000);
    uniform float pnan2 = floatbits(0x7FC00123);

    test(0.0f);
    test(-0.0f);
    test(1.0f);
    test(1.0000005f); // very imprecise for initial log
    test(1.0000010f);
    test(1.0000011f); // very imprecise for initial log
    test(1.0000012f);
    test(-1.0f);
    test(-10.0f);
    test(10.0f);
    test(1.15e+5);
    test(1.15e+36);
    test(1.15e+42);
    test(1.15e-36);
    test(pinf);
    test(ninf);
    test(pnan);
    test(nnan);
    test(pnan2);

    if(full_check)
    {
        uniform int64 maxi = ((uniform int64)1) << 32;
        for(uniform int64 i=0; i<maxi; ++i)
        {
            test_ext(floatbits((uniform int32)i), false);

            if(i % 100000000 == 0)
                print("%/100\n", i*100/maxi);
        }
    }

    if(measure_perf)
    {
        uniform int32 maxi = (((uniform int64)1) << 31) - 1; // skip negative numbers and few NaN to avoid overflows
        float checksum = 0.0;
        const uniform int64 startTime = clock();

        foreach(i=0...maxi)
            checksum += log(floatbits(i));

        const uniform int64 endTime = clock();
        print("checksum: %, clock: %e9\n", reduce_add(checksum), (uniform float)(endTime-startTime)*1e-9);
    }
}

It may be useful for another use-case (e.g. other math functions) or possibly for tests. It is quite convenient to check all possible float values (since tests based on random numbers are not really reliable).

@nurmukhametov
Copy link
Copy Markdown
Collaborator

Thank you for you contribution!

I suggest splitting this PR into two because it addresses two different problems.

To make it look neat, I suggest formatting the code (see the output of this job) and formatting the git message as follows: the first line is a short description, e.g., stdlib: fix overflow in exp(float) then an empty line, followed by a detailed description of any length, formatted to a width of 80 characters.

@aneshlya
Copy link
Copy Markdown
Collaborator

aneshlya commented Aug 7, 2024

Could you wrap your test code into a trivial microbenchmark, similar to what we have here: ISPC Benchmarks? This will allow us to retain and integrate your testing code effectively.

@aneshlya
Copy link
Copy Markdown
Collaborator

aneshlya commented Aug 7, 2024

Does your change anyhow utilizes Sleef algorithms?

@dbabokin
Copy link
Copy Markdown
Collaborator

dbabokin commented Aug 8, 2024

Could you wrap your test code into a trivial microbenchmark, similar to what we have here: ISPC Benchmarks? This will allow us to retain and integrate your testing code effectively.

We already have it for pow, log, and exp:

export void exp_##T(uniform T *uniform src, uniform T *uniform dst, uniform int count) { \

Copy link
Copy Markdown
Collaborator

@dbabokin dbabokin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let us know if you need help fixing formatting issues - we use clang-format version 12 for formatting. I can push the fix to your branch, if you'd like.

For "copyright check" failure - it's not related to your changes, just rebase to main to fix that.

Comment thread stdlib.ispc Outdated
float scaled = x_full * one_over_ln2;
float k_real = floor(scaled);
int k = (int)k_real;
int k = (int)k_real; // Be careful: float to int convertions are UB if the float value is too huge
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the comment in general, but I don't understand what useful information it brings. It has no call for action - i.e. if the overflow happens, then it happens. It seems there's nothing we can fix here, right? And I don't think we are going to tune this algorithm.

Please correct me if I get this wrong.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this comment because it was the source of the initial bug : AFAIR we used the k variable in a case where it was UB because of the overflow. So I fixed that by using directly k_real instead and found useful to mention this UB so to warn future people about this problem (avoiding them to repeat the same mistake again). Maybe it is not needed. I can remove the comment if you think so.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we are going to hack this algorithm in the future. So this comment belongs more to the commit description, rather than to code itself. Let's remove it.

Comment thread stdlib.ispc
int exponent;

const int NaN_bits = 0x7fc00000;
const int Neg_Inf_bits = 0xFF800000;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where the log algorithm is coming from? Possibly we need to put a reference to the origin of the algorithm to make sure we are not breaking the license.

I know that algorithms are not copyrightable, only specific implementation is. But we'd like to understand the origin and give the credit.

Copy link
Copy Markdown
Collaborator Author

@zephyr111 zephyr111 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my own modification from the initial ISPC algorithm. I fixed the special values from the initial algorithm and then generated a new lower-order polynomial (with my a Python script optimization algorithm I written myself) since a it is a bit faster and the precision is enough thanks to the correction close to log(1). Finally, I added a correction based on the Taylor series at log(1). I just checked the functions/errors with the great Geogebra tool.
Thus, the origin is me and there is not breaking any license :) .

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's impressive to say the least!

Two things then:

  1. Let's put implementation note before the algorithm to explain how you come up with it and what it precision characteristics are there. This information definitely belongs to the code, as we might find bugs in it, consider improvement, etc. And understanding the background is very important in this case.
  2. Have you "shopped around" for the alternatives? I.e. have you compared it's precision and performance with SVML and Sleef alternatives? It would be good to understand where it stands compared other proved implementations. If needed we can borrow algorithms from both libraries, it just needs to be done properly (license, copyright, etc).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

  1. There was already a comment in the code about the precision (in ULPs). I added information about how the method works.
  2. I never used Sleef yet, but it seems interesting/promising. I'm going to give it a try, but I think this can be done later, independently of this PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally performed a quick check/benchmark with Sleef (the one available on Ubuntu 24.04 LTS system packages) since is relatively simple in practice (as long as we use a standard C interface and not LLVM IR one which is apparently not documented).
The C interface might be slightly slower since there is a function call. However, the Sleef functions are not masked so adding a masking introduces slight overhead. In the end, the two might cancel out so the performance may be representative of what we can get with direct a IR interface (if any).
Sleef provide 2 groups of functions: ones with 1 ULP error (typically due to rounding), and ones with a 3.5 ULP error (let say 4). The former can be significantly slower (i.e. up to 10 time slower) than the later and I think 4 ULP is a good reference for a project like ISPC where people focus on performance (and also because the current ISPC math log/exp function have a 4 to 10 ULP error with the new PRs). Not to mention the performance of 4 ULP functions are more stable. See the benchmark for more information: https://sleef.org/nontrigsp.png and https://sleef.org/trigsp.png .
I wrote a "corrected" implementation (not pushed) so to get a <5 ULP error (instead of <10 ULP) in order not to mix apples and oranges when comparing this with Sleef. It simply use a 9-degreee polynomial instead of a (current) 8-degree one.
Regarding performance, on my machine (Zen2), I get (in clocks, for the entire 32-bit float range):

This PR implementation (default math):                   2.11e9
Corrected PR implementation:                   2.26e9
Sleef avx2-i32x8 with C interface and no masking:   2.73e9

Thus, in this case, Sleef is about 30% slower than the less-precise PR implementation and about 20% slower than the similarly-precise corrected PR implementation :) !
A part of the overhead might come from the C call but probably not all of it.
It shows that this PR implementation is pretty good since it is competitive with Sleef (which is itself competitive with the SVML based on the provided Sleef's benchmarks).

Overall, aside from this specific log implementation, IMHO, Sleef is pretty interesting for ISPC since it has the benefit to provide:

  • (pretty good) bounded-precision and support for special numbers (eg. NaN, Inf, subnormals, etc.)
  • challenging performance
  • certainly more tested functions than ISPC ones
  • (fast) double-precision functions
  • a flexible license certainly compatible with the ISPC one
  • an alternative implementation (useful when an implementation has issues)

@dbabokin
Copy link
Copy Markdown
Collaborator

dbabokin commented Aug 8, 2024

Also I measured the impact of the changes on Apple M3 chip with neon-i32x4 target. exp is not affected. log is slower approximately 50%.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

Does your change anyhow utilizes Sleef algorithms?

No. This is my own modification.

Could you wrap your test code into a trivial microbenchmark, similar to what we have here: ISPC Benchmarks? This will allow us to retain and integrate your testing code effectively.

We already have it for pow, log, and exp

Ok, so there is finally nothing to add in the microbenchmark, isn't it?

@aneshlya
Copy link
Copy Markdown
Collaborator

aneshlya commented Aug 8, 2024

Ok, so there is finally nothing to add in the microbenchmark, isn't it?

Yep, no changes.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

I improved the speed of the log function. It is now "only" 25% slower than the initial code for the target avx2-i32x8 and 55% for the target avx2-i32x16. This mitigate the strange issue related to the Inf (the assembly code seems minor but the performance impact is surprisingly huge). I am going to investigate if the avx2-i32x16 target can be optimized further. I am not sure there is much more to do now regarding performance of the target avx2-i32x8 (or Neon). I think 25% is the price to pay for the fixes.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

I tried to understand what was more precisely the numerical error so to use a more efficient approach for the correction close to x=1 and it worked well. The new approach is not only faster for the avx2-i32x8 target but also significantly more accurate. I described the approach more precisely in comments of the submitted ISPC file.

In the end, the new code is only 8% slower than the initial one for the avx2-i32x8 target and still 50% slower with the avx2-i32x16 (due to register spilling and apparently LLVM optimization issues too). The new code is slightly less precise (only few ULP) for most values, but much more precise for values close to x=1 : about 10-15 times better! Overall the new code has a relative precision of <10 ULP in practice (so 1_000-10_000 time more than the initial code).

I also fixed the formatting issues.
I am going to certainly remove the exp fix from this PR and open a new one as proposed by @nurmukhametov .

@zephyr111 zephyr111 force-pushed the better_math_funcs branch 2 times, most recently from f60d900 to 57ab827 Compare August 9, 2024 07:01
@zephyr111
Copy link
Copy Markdown
Collaborator Author

It should be fine now and ready to be merged :) !

@nurmukhametov nurmukhametov changed the title Improve 32-bit log/exp standard math functions Improve 32-bit log standard math functions Aug 9, 2024
@nurmukhametov
Copy link
Copy Markdown
Collaborator

In the end, the new code is only 8% slower than the initial one for the avx2-i32x8 target and still 50% slower with the avx2-i32x16 (due to register spilling and apparently LLVM optimization issues too).

It looks like if we move exceptional_result definition up like this:

@@ -3477,6 +3477,7 @@ __declspec(safe) static inline float log(float x_full) {
         const bool use_nan = !(x_full >= 0.);
         const bool use_inf = x_full == 0. || x_full == inf;
         const bool exceptional = use_nan || use_inf;
+        const float exceptional_result = select(use_nan, NaN, select(x_full == inf, x_full, -inf));
         const float one = 1.0;
 
         const float patched = select(exceptional, one, x_full);
@@ -3509,7 +3510,6 @@ __declspec(safe) static inline float log(float x_full) {
 
         result = log_from_exponent - x * result;
 
-        const float exceptional_result = select(use_nan, NaN, select(x_full == inf, x_full, -inf));
         return select(exceptional, exceptional_result, result);
     }
 }

then no spills are generated. Could you try this?

@zephyr111
Copy link
Copy Markdown
Collaborator Author

In practice, there are 6 spills in the current code and 7 spill if I move the line. The x16 target code is also 1% slower (not really significant). The x8 version seems not affected (as expected). Thus, it unfortunately does not help on my machine.

By the way, note that I saw different results between Godbolt and my local ISPC (with no other change than this PR) and I do not know why yet. Did you use that to check spilling?

I expected LLVM to track the dependency and perform a good register allocation so to reduce spilling as much as possible. In practice, such problem are not rare in ISPC with double-pumping. That being said, here, it is difficult for LLVM to produce a good code (except on AVX-512 thanks to cheap broadcasting embedded in instructions) based on the high number of registers to load and the number of variables. Still, there is a room for improvement since the x16 target is significantly slower than the x8 target and I think LLVM could theoretically just reorder most instructions so to generate operation on the first x8 items and then the second group. This is mostly the case but not totally. However, in practice, the boolean operation results in fused instructions preventing partially such optimization. This is a bit complicated since I think merging can be better when people do a lot of boolean operations on SSE/AVX-1/AVX-2/Neon targets while it is not a good idea otherwise except maybe in pathological cases (maybe it can reduce register pressure). I am not sure ISPC can make good decision while LLVM can certainly (thanks to a global optimization step). I am clearly not an expert of this so I may completely miss-understand how such thing work or how things should be improved. I think this problem is related to this issue. Still, I am not sure this is the only issue here.

For example, generated instructions are sub-optimal when spilling happens:

vblendvps	ymm2, ymm2, ymm6, ymm5
vblendvps	ymm2, ymm14, ymm2, ymm4
vmovups	ymm3, ymmword ptr [rsp - 112]   # 32-byte Reload
vaddps	ymm3, ymm3, ymm1
vmovups	ymmword ptr [rsp - 112], ymm3   # 32-byte Spill
vmovups	ymm1, ymmword ptr [rsp - 112]   # 32-byte Reload    <--------------------
vaddps	ymm0, ymm0, ymm2
add	eax, 16
cmp	eax, 2147483632
jne	.LBB109_1

Here we can see a useless reload which can be replaced by a basic vmovps ymm1, ymm3.

I think spilling can can responsible for up to a 30% slowdown on my machine so it may not fully explain the performance drop of the double-pumping version...

@nurmukhametov
Copy link
Copy Markdown
Collaborator

In practice, there are 6 spills in the current code and 7 spill if I move the line. The x16 target code is also 1% slower (not really significant). The x8 version seems not affected (as expected). Thus, it unfortunately does not help on my machine.

By the way, note that I saw different results between Godbolt and my local ISPC (with no other change than this PR) and I do not know why yet. Did you use that to check spilling?

I used my local build but the generated code looks similar to compiler-explorer's code link. I suggest to check which LLVM version you use. Release ISPC and trunk ISPC in compiler-explorer are built with the patched LLVM (see llvm_patches dir in repo). But I don't see difference on my machine between patched LLVM 17 and not-patched LLVM 18 for this example.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

zephyr111 commented Aug 11, 2024

TL;DR: the overhead of the target avx2-i32x16 (on Zen2) appears to be mainly the price to pay for the much better precision and correct support of special value. I do not think we can reduce it much more (certainly not more than 10%) unless a better approach is found (and I do not think there is a much better approach both reaching a <5 ULP precision and supporting correctly special values).


I profiled more intensively the double pumping implementation to better know what is happening on Zen2 with this PR. The code is bound by the number of instruction/uops which is about 50% higher with the PR. Since the IPC is good both before/after the modification (~2 FP uops/cycle), this makes sense to get a 50% slowdown.

Here are few difference between the two versions for the target avx2-i32x16:

  • the number of vblendvps instruction increases from 6 to 12 : this is normal because of the patch close to 0 so to increase the precision. Indeed, 3 new select are done (2 for the patch close to zero and 1 for the support of Inf).
  • 2 FMA instructions seems to be split in sub+mul instruction (I do not know why and if they can theoretically be fused).
  • the number of comparison instructions increases from 4 to 8 (new vcmpnleps/vcmpnltps) : this is normal due to the better support of Inf (first additional check) and the check of values close to 0 (second additional check).
  • the number of vxorps+vpor increases from 4 to 8 overall : this is certainly due to the check too though the vxorps could certainly be optimized out.
  • the number of mov-like instructions increases from 14 to 19 : this is due to register spilling (3 spill vs 6 spill) which is due to a higher register pressure itself certainly due to more constant to load and the additional operations. There is maybe a way to reduce it further but certainly not to reach the same amount as before.
  • 2 additional vandps instructions are generated : this is due to the check close to 0 (i.e. the call to abs).
  • few additional new instructions (x7) like vpshufd, vpackssdw and vpacksswb and vextracti128 are generated and I think this is missed optimization of LLVM/ISPC related to boolean operations. Indeed, we pack boolean of YMM registers so to perform the logical-or and then unpack the XMM registers so to get YMM ones back while we could just perform YMM operations when the number of booleans operation is not big (hard to do in practice without a non-trivial global optimization pass though).

Note that the loop calling log is unrolled twice (hence you should divide the number of instruction by two in the above list to get the number of instructions per call to log).

The target avx2-i32x8 is less impacted by the overheads than the x16 one simply because the avx2-i32x8 code is more latency bound. Thus, the additional instructions can be executed in parallel without adding a significant overhead. Meanwhile, the avx2-i32x16 code already hide latency thanks to double-pumping so additional instruction are not free anymore.

While the generated assembly code can certainly be slightly improved, most of the overheads directly come from the approach/fix so I think there is not much to do (besides using possibly another better approach, if any).

I used my local build but the generated code looks similar to compiler-explorer's code link. I suggest to check which LLVM version you use. Release ISPC and trunk ISPC in compiler-explorer are built with the patched LLVM (see llvm_patches dir in repo). But I don't see difference on my machine between patched LLVM 17 and not-patched LLVM 18 for this example.

Ok. I use the commit 9e74945 @ 20240806 and LLVM 18.1.3. I think I use a non-patched LLVM (from the Ubuntu system packages unless Ubuntu developers patched it).

Copy link
Copy Markdown
Collaborator

@nurmukhametov nurmukhametov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured performance both on GameDev tests and microbenchmarks. For both cases, avx2-i32x8 performance is 20~25 % worse (for tests using log). So, it looks we sacrifice performance for precision for avx2-32x8 also.

Comment thread stdlib.ispc Outdated
return z + 0.693359375 * fe;
} else if (__math_lib == __math_lib_ispc) {

// Precision: <10 ULP (for all input values).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ULP for x=1.120036 (0x3f8f5d57)?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative error formula gives 7.88694e-7 which looks like <10 ULP at first glance but as we discussed by mail, the formula is not very precise and the number of ULP can be a bit bigger than that. In practice the result is 0x3de829c5 instead of 0x3de829b9 for the truncated value. This means an error of 12 ULP so the error bound is not correct (because of the imprecise formula to check results).

Copy link
Copy Markdown
Collaborator Author

@zephyr111 zephyr111 Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I modify the checking code so to compute the error directly in ULP with the formula abs((uniform int64)intbits(res) - (uniform int64)intbits(dres)), then I get a maximum error of 12 ULP for normal numbers. This maximum error is reached for values closed to the one you mention. Note that this error is maximal for such values because of the problem mentioned above in this PR and present in the current ISPC code too (though the error is much bigger).

This means the actual error bound should not be "<10 ULP" but 12 ULP. Let say "<15 ULP" so to be safe.

For subnormal numbers (that is input values close to zeros and output close to -inf), the error is significantly bigger. I expected the mantissa to be >0.5 but I think this is not true with subnormal numbers. I need to check that. If this is the case, then fixing subnormal numbers requires the polynomial of the approximation to be defined on a wider range which will reduce the precision for other numbers (unless we use a higher-order polynomial which is more expensive).

Copy link
Copy Markdown
Collaborator Author

@zephyr111 zephyr111 Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the subnormal numbers, it turns out that the problem was already present in the current ISPC implementation so this is not a problem introduced by this PR. It can be fixed/mitigated, but again, at the expense of a slower implementation. Note the error can be pretty significant for small subnormal numbers. For example, for 1.20206185e-39 (0xd16dc), the results should be -89.616783 while it is -87.932335 (and -87.932327 for the current ISPC implementation which is not really better). I think this is certainly due to the __range_reduce_log function?
Tell me if you want to improve the precision for subnormal numbers (certainly at the expense of an even slower implementation).
A simple fix make the code 50% slower again (which is pretty expensive only for subnormal numbers). As a result, it becomes a bit slower than the Sleef implementation so I think there is certainly a more efficient approach for the overall function.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the actual error bound should not be "<10 ULP" but 12 ULP. Let say "<15 ULP" so to be safe.

This is aligned with my estimation.

@zephyr111
Copy link
Copy Markdown
Collaborator Author

I measured performance both on GameDev tests and microbenchmarks. For both cases, avx2-i32x8 performance is 20~25 % worse (for tests using log). So, it looks we sacrifice performance for precision for avx2-32x8 also.

Yes. Note that the results are dependent of the target architecture. Out of curiosity, on which architecture/CPU did you run this benchmark?

… its precision close to x=1

Fix invalid results for 32-bit log:
- Fix a missing support for NaN and Inf in log.
- Fix a numerical instability for input values close to x=1 in log (more specifically values in the range 1-1.25). The precision is now about 10_000 times more precise in this case. It should improve exp too.
- Small performance drop in AVX-2 except for double pumping (~8% for avx2-i32x8 & 50% for avx2-i32x16 on Zen2).
@nurmukhametov
Copy link
Copy Markdown
Collaborator

I measured performance both on GameDev tests and microbenchmarks. For both cases, avx2-i32x8 performance is 20~25 % worse (for tests using log). So, it looks we sacrifice performance for precision for avx2-32x8 also.

Yes. Note that the results are dependent of the target architecture. Out of curiosity, on which architecture/CPU did you run this benchmark?

I tested on Intel i9-12900 (GameDev tests) and AMD 7840HS (microbenchmarks).

@zephyr111
Copy link
Copy Markdown
Collaborator Author

I am closing this PR since it is superseded by the recent new one.

@zephyr111 zephyr111 closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants