Skip to content

rp2: Build with -fno-math-errno to use the hardware sqrt instruction.#19356

Open
Gadgetoid wants to merge 1 commit into
micropython:masterfrom
pimoroni:rp2-fno-math-errno
Open

rp2: Build with -fno-math-errno to use the hardware sqrt instruction.#19356
Gadgetoid wants to merge 1 commit into
micropython:masterfrom
pimoroni:rp2-fno-math-errno

Conversation

@Gadgetoid

@Gadgetoid Gadgetoid commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Without -fno-math-errno the compiler must keep sqrt()/sqrtf() as library calls so they can set errno on a domain error, even though the Cortex-M33 FPU has a single VSQRT.F32 instruction.

MicroPython's math module detects domain errors via isnan/isinf checks on the result rather than errno, so disabling errno here is safe and lets sqrt leverage hardware.

Summary

Make RP2 go brr. Seriously when you start to write apps and games with vector graphics and all sorts of shiny stuff you want to eke out every little iota of performance you can!

Testing

Benchmarked on a Pico 2 (RP2350, perfbench N=150 M=100, avg of 3): misc_mandel (complex abs() in its inner loop) improves by ~12+%, with no measurable change to non-sqrt benchmarks and a slightly smaller binary.

Benchmark Before (score) After (score) Change
bm_chaos 293.69 298.50 +1.6%
bm_fannkuch 82.42 81.32 −1.3%
bm_fft 3189.25 3223.54 +1.1%
bm_float 4720.96 4735.30 +0.3%
bm_hexiom 39.33 39.59 +0.7%
bm_nqueens 3154.12 3076.48 −2.5%
bm_pidigits 612.63 608.32 −0.7%
bm_wordcount 71.13 65.64 −7.7%
misc_aes 530.36 510.17 −3.8%
misc_mandel 2936.27 3415.50 +16.3%
misc_pystone 2197.17 2151.49 −2.1%
misc_raytrace 317.59 319.02 +0.5%

Ran the test suite:

937 tests performed (26410 individual testcases)
937 tests passed
122 tests skipped:

Trade-offs and Alternatives

None, as far as I'm aware. This is a win/win.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.

Without -fno-math-errno the compiler must keep sqrt()/sqrtf() as library
calls so they can set errno on a domain error, even though the Cortex-M33
FPU has a single VSQRT.F32 instruction. MicroPython's math module
detects domain errors via isnan/isinf checks on the result rather than
errno, so disabling errno here is safe and lets sqrt leverage hardware.

Benchmarked on a Pico 2 (RP2350, perfbench N=150 M=100, avg of 3):
misc_mandel (complex abs() in its inner loop) improves by ~12%, with no
measurable change to non-sqrt benchmarks and a slightly smaller binary.

Signed-off-by: Phil Howard <github@gadgetoid.com>
@github-actions

Copy link
Copy Markdown

Code size report:

Reference:  unix/README: Update the supported targets list. [d901e98]
Comparison: rp2: Build with -fno-math-errno to use the hardware sqrt instruction. [merge of 5176568]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:   -24 -0.003% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@dpgeorge

Copy link
Copy Markdown
Member

Is this only relevant for RP2350, and possibly only ARM mode? If so is it worth wrapping this in if(PICO_RP2350 AND PICO_ARM) to make that clear?

Or, does it generate faster code for all archs?

@Gadgetoid

Gadgetoid commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

I would not expect any change to RP2040 or RP2350 RISC, but I'll run the tests just in case. (god knows I need to do more of that.)

Edit: No change on RISCV but did net an 8 byte size saving. I would suggest it's correct to include it in RISCV builds, but more to let the compiler make informed choices than for any specific size/perf improvement. The 8 bytes was the net result of both a gain and a loss so much for muchness really:

function before after delta
__ieee754_lgammaf_r 2836 2814 -22
__kernel_rem_pio2f 1706 1716 +10
mp_decimal_exp 144 146 +2

The RP2040 has similar results, layout in RAM changes and some things perform better vs others worse- no clear cut performance advantage and any results would be extremely brittle depending on how the code lands in cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants