Skip to content

rp2: Revert newlib nano on RP2350, move nano libc to RAM on RP2040.#19352

Draft
projectgus wants to merge 3 commits into
micropython:masterfrom
projectgus:bugfix/rp2_nano_performance
Draft

rp2: Revert newlib nano on RP2350, move nano libc to RAM on RP2040.#19352
projectgus wants to merge 3 commits into
micropython:masterfrom
projectgus:bugfix/rp2_nano_performance

Conversation

@projectgus

@projectgus projectgus commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

This is a follow-up to fix some performance regressions from #19299:

  • Some nano libc functions weren't being picked up by the linker script to link into RAM, causing more flash cache misses. (This probably explains the +720 bytes free .bss in PR 19299!)
  • RP2350 performance was significantly impacted by switching from the standard libc memcpy & memset, which unroll the loop, to the nano versions which only do byte by byte copies. Thanks @kilograham for bringing this to our attention.

In this PR:

  • Only enable nano.specs for RP2040.
  • On RP2040, link all libc string functions, mem functions, and the pico_mem_ops functions to RAM. The impact here is relatively small because there are the nano libc functions, and the "pico_mem_ops" are very small shim wrappers around calls to ROM memcpy and memcmp. Most of these functions were not in RAM in earlier MicroPython versions.

Testing

Using perfbench, regarding any variation <4% as within the margin of error for cache effects caused by different binaries.

RP2040

Comparing pre-nano MicroPython commit to current master:

diff of scores (higher is better)
N=168 M=100                rp2040_pre_nano.txt -> rp2040_master.txt         diff      diff% (error%)
bm_chaos.py                    154.11 ->     152.00 :      -2.11 =  -1.369% (+/-0.09%)
bm_fannkuch.py                  56.47 ->      55.06 :      -1.41 =  -2.497% (+/-0.03%)
bm_fft.py                     1372.81 ->    1439.68 :     +66.87 =  +4.871% (+/-0.02%)
bm_float.py                   1776.33 ->    1772.48 :      -3.85 =  -0.217% (+/-0.09%)
bm_hexiom.py                    23.13 ->      23.23 :      +0.10 =  +0.432% (+/-0.06%)
bm_nqueens.py                 1965.06 ->    2051.67 :     +86.61 =  +4.407% (+/-0.14%)
bm_pidigits.py                 404.07 ->     412.89 :      +8.82 =  +2.183% (+/-0.06%)
bm_wordcount.py                 38.95 ->      34.11 :      -4.84 = -12.426% (+/-0.10%)
core_import_mpy_multi.py       232.11 ->     230.26 :      -1.85 =  -0.797% (+/-0.10%)
core_import_mpy_single.py       44.16 ->      42.96 :      -1.20 =  -2.717% (+/-0.27%)
core_locals.py                  32.57 ->      32.52 :      -0.05 =  -0.154% (+/-0.01%)
core_qstr.py                   125.85 ->     116.52 :      -9.33 =  -7.414% (+/-0.10%)
core_str.py                     17.67 ->      16.61 :      -1.06 =  -5.999% (+/-0.04%)
core_yield_from.py             225.63 ->     225.63 :      +0.00 =  +0.000% (+/-0.01%)
misc_aes.py                    239.24 ->     240.31 :      +1.07 =  +0.447% (+/-0.09%)
misc_mandel.py                1188.01 ->    1244.04 :     +56.03 =  +4.716% (+/-0.05%)
misc_pystone.py               1092.67 ->    1049.90 :     -42.77 =  -3.914% (+/-0.08%)
misc_raytrace.py               164.20 ->     161.21 :      -2.99 =  -1.821% (+/-0.05%)
viper_call0.py                 319.99 ->     319.98 :      -0.01 =  -0.003% (+/-0.00%)
viper_call1a.py                312.78 ->     312.78 :      +0.00 =  +0.000% (+/-0.01%)
viper_call1b.py                235.10 ->     235.10 :      +0.00 =  +0.000% (+/-0.00%)
viper_call1c.py                236.88 ->     236.88 :      +0.00 =  +0.000% (+/-0.00%)
viper_call2a.py                308.16 ->     308.16 :      +0.00 =  +0.000% (+/-0.01%)
viper_call2b.py                206.72 ->     206.72 :      +0.00 =  +0.000% (+/-0.00%)

Most of these changes are within the margin of error, but notable ones include core_qstr.py and core_str.py becoming slower.

Comparing pre-nano to this PR:

diff of scores (higher is better)
N=168 M=100                rp2040_pre_nano.txt -> rp2040_pr_branch.txt         diff      diff% (error%)
bm_chaos.py                    154.11 ->     152.08 :      -2.03 =  -1.317% (+/-0.08%)
bm_fannkuch.py                  56.47 ->      54.81 :      -1.66 =  -2.940% (+/-0.01%)
bm_fft.py                     1372.81 ->    1429.48 :     +56.67 =  +4.128% (+/-0.01%)
bm_float.py                   1776.33 ->    1743.59 :     -32.74 =  -1.843% (+/-0.11%)
bm_hexiom.py                    23.13 ->      22.75 :      -0.38 =  -1.643% (+/-0.05%)
bm_nqueens.py                 1965.06 ->    2125.50 :    +160.44 =  +8.165% (+/-0.05%)
bm_pidigits.py                 404.07 ->     408.43 :      +4.36 =  +1.079% (+/-0.06%)
bm_wordcount.py                 38.95 ->      38.44 :      -0.51 =  -1.309% (+/-0.03%)
core_import_mpy_multi.py       232.11 ->     228.28 :      -3.83 =  -1.650% (+/-0.09%)
core_import_mpy_single.py       44.16 ->      41.46 :      -2.70 =  -6.114% (+/-0.25%)
core_locals.py                  32.57 ->      32.55 :      -0.02 =  -0.061% (+/-0.01%)
core_qstr.py                   125.85 ->     123.39 :      -2.46 =  -1.955% (+/-0.11%)
core_str.py                     17.67 ->      17.33 :      -0.34 =  -1.924% (+/-0.03%)
core_yield_from.py             225.63 ->     225.71 :      +0.08 =  +0.035% (+/-0.01%)
misc_aes.py                    239.24 ->     230.58 :      -8.66 =  -3.620% (+/-0.09%)
misc_mandel.py                1188.01 ->    1229.93 :     +41.92 =  +3.529% (+/-0.06%)
misc_pystone.py               1092.67 ->    1065.89 :     -26.78 =  -2.451% (+/-0.06%)
misc_raytrace.py               164.20 ->     160.63 :      -3.57 =  -2.174% (+/-0.08%)
viper_call0.py                 319.99 ->     319.98 :      -0.01 =  -0.003% (+/-0.00%)
viper_call1a.py                312.78 ->     312.78 :      +0.00 =  +0.000% (+/-0.01%)
viper_call1b.py                235.10 ->     235.09 :      -0.01 =  -0.004% (+/-0.00%)
viper_call1c.py                236.88 ->     236.88 :      +0.00 =  +0.000% (+/-0.01%)
viper_call2a.py                308.16 ->     308.15 :      -0.01 =  -0.003% (+/-0.01%)
viper_call2b.py                206.72 ->     206.71 :      -0.01 =  -0.005% (+/-0.00%)

These results are all relatively noisy and it's hard to draw clear conclusions, but overall the second set of changes look to have less significant regression to me. The -6% on core_import_mpy_single.py is odd, but this didn't appear in an earlier version of this PR so it's probably noise due to cache layout.

(EDIT: Previous version of this analysis I read something backwards!)

However at least we can say there's no obvious regression, and we still have smaller binary size & RAM usage compared to pre-nano. (339608 flash & 12692 RAM pre-nano, 338720 & 12388 with this PR.)

RP2350

Pre-nano vs current master:

diff of scores (higher is better)
N=168 M=100                rp2350_pre_nano.txt -> rp2350_master.txt         diff      diff% (error%)
bm_chaos.py                    307.38 ->     274.25 :     -33.13 = -10.778% (+/-0.08%)
bm_fannkuch.py                  92.60 ->      84.65 :      -7.95 =  -8.585% (+/-0.05%)
bm_fft.py                     2989.25 ->    2776.11 :    -213.14 =  -7.130% (+/-0.03%)
bm_float.py                   4833.23 ->    4362.35 :    -470.88 =  -9.743% (+/-0.09%)
bm_hexiom.py                    43.77 ->      37.62 :      -6.15 = -14.051% (+/-0.05%)
bm_nqueens.py                 3708.24 ->    3124.45 :    -583.79 = -15.743% (+/-0.06%)
bm_pidigits.py                 773.64 ->     578.45 :    -195.19 = -25.230% (+/-0.07%)
bm_wordcount.py                 67.40 ->      67.69 :      +0.29 =  +0.430% (+/-0.03%)
core_import_mpy_multi.py       430.47 ->     445.18 :     +14.71 =  +3.417% (+/-0.08%)
core_import_mpy_single.py       89.38 ->      92.21 :      +2.83 =  +3.166% (+/-0.22%)
core_locals.py                  59.85 ->      57.90 :      -1.95 =  -3.258% (+/-0.04%)
core_qstr.py                   188.09 ->     199.66 :     +11.57 =  +6.151% (+/-0.07%)
core_str.py                     29.84 ->      29.35 :      -0.49 =  -1.642% (+/-0.05%)
core_yield_from.py             401.27 ->     362.21 :     -39.06 =  -9.734% (+/-0.03%)
misc_aes.py                    432.89 ->     393.01 :     -39.88 =  -9.213% (+/-0.07%)
misc_mandel.py                3298.48 ->    3183.31 :    -115.17 =  -3.492% (+/-0.06%)
misc_pystone.py               2012.00 ->    1884.39 :    -127.61 =  -6.342% (+/-0.09%)
misc_raytrace.py               323.51 ->     296.71 :     -26.80 =  -8.284% (+/-0.04%)
viper_call0.py                 559.93 ->     559.91 :      -0.02 =  -0.004% (+/-0.01%)
viper_call1a.py                546.13 ->     546.11 :      -0.02 =  -0.004% (+/-0.01%)
viper_call1b.py                449.52 ->     449.52 :      +0.00 =  +0.000% (+/-0.00%)
viper_call1c.py                456.37 ->     456.35 :      -0.02 =  -0.004% (+/-0.00%)
viper_call2a.py                536.34 ->     536.33 :      -0.01 =  -0.002% (+/-0.01%)
viper_call2b.py                400.32 ->     400.32 :      +0.00 =  +0.000% (+/-0.01%)

😬 Not great, I should have checked this before merging 19929!

Pre-nano versus this PR:

diff of scores (higher is better)
N=168 M=100                rp2350_pre_nano.txt -> rp2350_pr_branch.txt         diff      diff% (error%)
bm_chaos.py                    307.38 ->     309.29 :      +1.91 =  +0.621% (+/-0.08%)
bm_fannkuch.py                  92.60 ->      93.53 :      +0.93 =  +1.004% (+/-0.06%)
bm_fft.py                     2989.25 ->    2818.90 :    -170.35 =  -5.699% (+/-0.03%)
bm_float.py                   4833.23 ->    4707.45 :    -125.78 =  -2.602% (+/-0.12%)
bm_hexiom.py                    43.77 ->      44.29 :      +0.52 =  +1.188% (+/-0.05%)
bm_nqueens.py                 3708.24 ->    3708.21 :      -0.03 =  -0.001% (+/-0.07%)
bm_pidigits.py                 773.64 ->     765.10 :      -8.54 =  -1.104% (+/-0.09%)
bm_wordcount.py                 67.40 ->      65.27 :      -2.13 =  -3.160% (+/-0.04%)
core_import_mpy_multi.py       430.47 ->     429.81 :      -0.66 =  -0.153% (+/-0.07%)
core_import_mpy_single.py       89.38 ->      88.22 :      -1.16 =  -1.298% (+/-0.15%)
core_locals.py                  59.85 ->      59.72 :      -0.13 =  -0.217% (+/-0.04%)
core_qstr.py                   188.09 ->     191.20 :      +3.11 =  +1.653% (+/-0.09%)
core_str.py                     29.84 ->      29.00 :      -0.84 =  -2.815% (+/-0.03%)
core_yield_from.py             401.27 ->     401.29 :      +0.02 =  +0.005% (+/-0.01%)
misc_aes.py                    432.89 ->     433.29 :      +0.40 =  +0.092% (+/-0.06%)
misc_mandel.py                3298.48 ->    3006.84 :    -291.64 =  -8.842% (+/-0.05%)
misc_pystone.py               2012.00 ->    1955.00 :     -57.00 =  -2.833% (+/-0.10%)
misc_raytrace.py               323.51 ->     319.08 :      -4.43 =  -1.369% (+/-0.04%)
viper_call0.py                 559.93 ->     564.12 :      +4.19 =  +0.748% (+/-0.01%)
viper_call1a.py                546.13 ->     550.12 :      +3.99 =  +0.731% (+/-0.01%)
viper_call1b.py                449.52 ->     452.24 :      +2.72 =  +0.605% (+/-0.00%)
viper_call1c.py                456.37 ->     459.16 :      +2.79 =  +0.611% (+/-0.00%)
viper_call2a.py                536.34 ->     540.22 :      +3.88 =  +0.723% (+/-0.01%)
viper_call2b.py                400.32 ->     402.48 :      +2.16 =  +0.540% (+/-0.01%)

Would expect these results to be basically the same as pre-nano, so I think the main cause of changes here is noise...

Trade-offs and Alternatives

  • We could completely revert rp2: Build with nano.specs, add linker cref table #19299 and keep using default.specs everywhere, the differences are not that big either way so this might be the simpler approach.
  • Could link our versions of libc memory & string functions from shared/libc/string0.c instead. This has a version of memcpy that uses full word operations, for example. Initial testing on RP2350 this showed quite small size, and performance mid-way between "nano" libc and the default newlib functions.
  • We could also look at moving string functions and memcpy/memcmp to RAM on RP2350 to get more performance at the cost of less free RAM. This might be worth looking into in a follow-up PR.
  • While making these changes I noted that the rp2 port linker scripts try to link *gc.c.obj *vm.c.obj *parse.c.obj to RAM, but these don't match anything as the pico-sdk sets CMAKE_C_OUTPUT_EXTENSION to .o. So we should either remove these, or experiment with the RAM/Performance trade-off of putting these parts of MicroPython into RAM.

Generative AI

I did not use generative AI tools when creating this PR.

These are thin wrappers around the ROM functions for memcpy
and memset, just a few bytes - this way avoids a cache miss
when calling them.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
Fixes performance regression on RP2350 when switching to nano.specs in
6552836.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
As these are the "nano" versions the impact is relatively small.

This work was funded through GitHub Sponsors.

Signed-off-by: Angus Gratton <angus@redyak.com.au>
@github-actions

Copy link
Copy Markdown

Code size report:

Reference:  unix/README: Update the supported targets list. [d901e98]
Comparison: rp2: Link libc string functions to ram on RP2040. [merge of ce1e884]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:  +256 +0.028% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@projectgus

Copy link
Copy Markdown
Contributor Author

rp2: +256 +0.028% RPI_PICO_W

I don't understand why code size difference hasn't picked up any increase of static RAM use here.

@projectgus

Copy link
Copy Markdown
Contributor Author

rp2: +256 +0.028% RPI_PICO_W

I don't understand why code size difference hasn't picked up any increase of static RAM use here.

Ah OK,something else weird is going on here.

Here's RPI_PICO_W built in this PR in CI:

2026-06-18T04:30:58.1819030Z Memory region         Used Size  Region Size  %age Used
2026-06-18T04:30:58.1819609Z            FLASH:      878316 B      1200 KB     71.48%
2026-06-18T04:30:58.1820041Z         FLASH_FS:           0 B       848 KB      0.00%
2026-06-18T04:30:58.1820462Z              RAM:       55984 B       256 KB     21.36%
2026-06-18T04:30:58.1820861Z        SCRATCH_X:           0 B          0 B
2026-06-18T04:30:58.1821266Z        SCRATCH_Y:          8 KB         8 KB    100.00%

... and as built in latest master branch commit:

2026-06-12T08:07:33.2710825Z Memory region         Used Size  Region Size  %age Used
2026-06-12T08:07:33.2711362Z            FLASH:      878068 B      1200 KB     71.46%
2026-06-12T08:07:33.2711765Z         FLASH_FS:           0 B       848 KB      0.00%
2026-06-12T08:07:33.2712172Z              RAM:       52944 B       256 KB     20.20%
2026-06-12T08:07:33.2712541Z        SCRATCH_X:           0 B          0 B
2026-06-12T08:07:33.2712924Z        SCRATCH_Y:          8 KB         8 KB    100.00%

Somehow this PR is using 3KB more RAM, but if I build these here then the difference is +700 bytes of RAM.

Need to investigate more, probably this is a newlib version thing.

@octoprobe-bot

Copy link
Copy Markdown

Octoprobe PR report

Test Tests
passed
Tests
skipped
Tests
xfailed
Tests
failed
format flash 5
run-tests.py 4727 565
run-tests.py --via-mpy --emit native 4661 630 1
run-tests.py --via-mpy 4723 567 2
run-perfbench.py 120
run-natmodtests.py 180 23 2
run-tests.py --test-dirs=extmod_hardware 7 30 11 2
run-tests.py --test-dirs=extmod_hardware --emit-native 9 30 11
Total 14432 1845 24 5
Failures

Group: run-tests.py --test-dirs=extmod_hardware

Test rp2
5334-
RPI_PICO2
rp2
5334-
RPI_PICO2-
RISCV
rp2
552b-
RPI_PICO2_W
rp2
5f2c-
RPI_PICO_W
rp2
6038-
RPI_PICO_W
extmod_hardware/machine_pwm.py XFAIL
xfail_master_478.json
XFAIL
xfail_master_478.json
pass FAIL FAIL

Group: run-tests.py --via-mpy --emit native

Test rp2
5334-
RPI_PICO2
rp2
5334-
RPI_PICO2-
RISCV
rp2
552b-
RPI_PICO2_W
rp2
5f2c-
RPI_PICO_W
rp2
6038-
RPI_PICO_W
extmod/select_poll_udp.py skip skip pass FAIL pass

Group: run-tests.py --via-mpy

Test rp2
5334-
RPI_PICO2
rp2
5334-
RPI_PICO2-
RISCV
rp2
552b-
RPI_PICO2_W
rp2
5f2c-
RPI_PICO_W
rp2
6038-
RPI_PICO_W
extmod/select_poll_eintr.py skip skip pass FAIL pass
extmod/select_poll_udp.py skip skip pass FAIL pass

Comment thread ports/rp2/CMakeLists.txt
if(PICO_RP2040)
# Enable nano.specs for RP2040 only.
#
# Pico-sdk already enables nosys.specs to stub out syscall handlers,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw this related pico-sdk PR: raspberrypi/pico-sdk#3014

That will be in the next pico-sdk release. Not sure if it means anything for us?

@dpgeorge

Copy link
Copy Markdown
Member
  • While making these changes I noted that the rp2 port linker scripts try to link *gc.c.obj *vm.c.obj *parse.c.obj to RAM, but these don't match anything as the pico-sdk sets CMAKE_C_OUTPUT_EXTENSION to .o.

Oh wow! These have been there since the beginning and I'm certain I bench marked the effect of putting this in RAM... would be worth revisiting this to test performance of them actually being in RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants