py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS. by Gadgetoid · Pull Request #19363 · micropython/micropython

Gadgetoid · 2026-06-21T20:15:42Z

Edit: I'm putting the final tests on optimising traversing gaps in the heap and finaliser scans which rely on this same technique. Please hold!

⚠️ Wrt to summary below, the description below covers the single case for tail traversal, but I've generalised this approach to skipping over free runs and finaliser scans. Needs some more testing to find the baseline impact of this change. Split into three commits to make it easier to bisect and test.

Tests in: free run doesn't do as much as I'd hoped, but it does something (~.5us/KB on PSRAM).

Summary

Somewhat a counterpart to: #19367

The GC sweep walked the allocation table one 16-byte block at a time. The interior of a large allocation is a long run of AT_TAIL entries (an ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa is sixteen).

Such a run never contains a HEAD/MARK, so it is wholly live or wholly dead and can be processed a word at a time - free the whole word when dead, or just advance the high-water mark when live - instead of touching every block. This is the dominant sweep cost for large pure-data buffers: e.g. on RP2040 a 96 KB bytearray collect drops from 7363us to 5747us.

The optimisation is gated behind a new MICROPY_GC_FAST_TAIL_SWEEP option, default disabled; when off the sweep is unchanged. Enabling it costs +112 bytes of flash and no RAM on RP2040 (Cortex-M0+).

The four ATB bytes are read via memcpy (with an alignment hint) rather than casting to uint32_t*, to avoid a strict-aliasing violation. A byte-step-to-alignment then word-step loop issues only aligned word accesses, so it is safe on cores that fault/trap on unaligned loads (Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V. MICROPY_GC_HOOK_LOOP is called within the word loop.

Testing

I have rigorously tested this in every way I can contrive, including byte for byte comparisons of memory state before/after GC and before/after this change and multi-hour runtime tests (including benchmarking) on our Tufty 2350 board. Testing covered both the Hazard3(RISCV)/M33 cores of the RP2350, and the RP2040's M0.

Note that the standard perfbench suite will do anything from very small deviations in performance to wild swings (I measured -30% in aes and -20% in pystone ). Since GC rarely changes (it's one of the dark corners of the project that takes work to understand and care to change) I end up with some new contrived script every time I want to exercise it with benchmarks.

Would the perftest suite be the right place for a GC benchmark? Given how painful GC is on PSRAM (granted, MicroPython never envisioned leaping from 192k to 8MB RAM) it might be worth measuring it specifically.

Trade-offs and Alternatives

This feature is gated behind a flag since it uses about 112 bytes of flash. It is particularly performant for the large allocations - such as heavy bytearrays or buffers - that we're using in our graphics API, but does add a very small cost to gc.collect(). This is very much an enable-if-you-use-PSRAM change, since GC is absolutely pathological when you start storing megabytes of buffers.

I have another chain to add a new no-scan bitflag which gives an orders of magnitude improvement for GC on large PSRAM buffers. Again extremely rigorously tested and benchmarked by every angle I can contrive- this PR is the cherry on top, but also self contained and the less invasive of the two changes.

A weekend of crunching, tests and tweaks has left me with something I'm not completely terrified to raise in a PR. Nonetheless I would appreciate some thoughts/independent verification on this to catch anything I might have missed.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.

(and of course I forgot to --signoff 🤦)

codecov · 2026-06-21T20:29:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.51%. Comparing base (5770ced) to head (46b83a0).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #19363   +/-   ##
=======================================
  Coverage   98.51%   98.51%           
=======================================
  Files         176      176           
  Lines       22904    22905    +1     
=======================================
+ Hits        22563    22564    +1     
  Misses        341      341

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-06-21T20:30:23Z

Code size report:

Reference:  tools/mpy_ld.py: Allow overriding the internal MPY file name. [b49f098]
Comparison: rp2: Enable fast table scans. [merge of 46b83a0]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +1 +0.001% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    -8 -0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:  +168 +0.018% RPI_PICO_W
       samd:    -8 -0.003% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    -4 -0.001% VIRT_RV32

Gadgetoid · 2026-06-22T16:12:44Z

Did some more crunching and benchmarking to discover I'd missed that this same optimisation applies in two other places (ATB all-free and FTB all-zero), currently working on implementing and verifying it, since it conceptually matches this change and they'd make a nice combined package.

Will change the flag to something like MICROPY_GC_FAST_TABLE_SCANS, though I'm open to suggestions.

The GC sweep walked the allocation table one 16-byte block at a time. The interior of a large allocation is a long run of AT_TAIL entries (an ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa sixteen). Such a run never contains a HEAD/MARK, so it is wholly live or wholly dead and can be processed a word at a time - free the whole word when dead, or just advance the high-water mark when live - instead of touching every block. This is the dominant sweep cost for large pure-data buffers: on RP2040 a 96 KB bytearray collect drops from 7363 to 5747 us. This introduces MICROPY_GC_FAST_TABLE_SCANS (default disabled), which gates a family of word-at-a-time scans over the GC's per-block tables; when off the behaviour is unchanged. This commit adds the first: coalescing tail-block runs in the sweep. Following commits add free-run coalescing and a finaliser-table skip under the same option. Enabling this commit alone costs +112 bytes of flash and no RAM on RP2040. The four ATB bytes are read via memcpy (with an alignment hint) rather than casting to uint32_t*, to avoid a strict-aliasing violation. A byte-step-to-alignment then word-step loop issues only aligned word accesses, so it is safe on cores that fault on unaligned loads (Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V. MICROPY_GC_HOOK_LOOP is called within the word loop. Signed-off-by: Phil Howard <github@gadgetoid.com>

Gadgetoid · 2026-06-22T17:44:19Z

Starting from the no-scan and building the optimised allocation table traversal on top gets us from ~130us/KB down to ~1.6us/KB. This is a very synthetic benchmark with large buffers and a small amount of baseline fragmentation, but it does more or less reflect the real world issues we ran into:

Extend the sweep's fast path (MICROPY_GC_FAST_TABLE_SCANS) to skip long runs of free blocks a word at a time, not just all-tail runs. An all-free ATB byte/word (0x00) needs no work - the blocks are already free and not in use - so a large gap (e.g. a freed buffer below still-live data, which keeps the high-water mark up) is crossed 16 blocks per word instead of one at a time. Shares the same word read and byte-step-to-alignment dance as the tail-run coalescing. On a Pico LiPo 2 (RP2350B, 8 MB PSRAM) a collect crossing a 2 MB free hole below live data saves ~.5us/KB. Signed-off-by: Phil Howard <github@gadgetoid.com>

gc_sweep_run_finalisers() reads the finaliser table (FTB) byte by byte up to the high-water mark to find blocks that may have a __del__. The FTB over a large pure-data buffer is all zero, yet is still scanned a byte at a time - costly when the table lives in slow PSRAM. Under MICROPY_GC_FAST_TABLE_SCANS, skip all-zero FTB words (32 blocks) at a time, using the same aligned-word read as the sweep. A non-zero word (a block that may have a finaliser) always falls through to the unchanged per-byte handler, so finaliser execution is unaffected. Applied only when the FTB is the sole table scanned here (weakref disabled). On a Pico LiPo 2 the finaliser scan for a live 4 MB buffer drops from ~4.0 to ~1.8 ms. Signed-off-by: Phil Howard <github@gadgetoid.com>

For CI, build tests only. Signed-off-by: Phil Howard <github@gadgetoid.com>

Gadgetoid force-pushed the gc-tailskip branch from 41ba681 to be7d970 Compare June 21, 2026 20:17

Gadgetoid force-pushed the gc-tailskip branch from 711ccfe to 3d5b51a Compare June 21, 2026 21:19

Gadgetoid mentioned this pull request Jun 22, 2026

py/gc: Track data and skip scan, MICROPY_GC_NO_SCAN #19367

Draft

dpgeorge added the py-core Relates to py/ directory in source label Jun 22, 2026

Gadgetoid force-pushed the gc-tailskip branch from 3d5b51a to fc70fec Compare June 22, 2026 17:10

Gadgetoid changed the title ~~py/gc: Add fast tail-block sweep behind MICROPY_GC_FAST_TAIL_SWEEP.~~ py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS. Jun 22, 2026

Gadgetoid added 3 commits June 22, 2026 20:24

rp2: Enable fast table scans.

46b83a0

For CI, build tests only. Signed-off-by: Phil Howard <github@gadgetoid.com>

Gadgetoid force-pushed the gc-tailskip branch from fc70fec to 46b83a0 Compare June 22, 2026 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363

py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363
Gadgetoid wants to merge 4 commits into
micropython:masterfrom
pimoroni:gc-tailskip

Gadgetoid commented Jun 21, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

Gadgetoid commented Jun 22, 2026

Uh oh!

Gadgetoid commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Gadgetoid commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Trade-offs and Alternatives

Generative AI

Uh oh!

codecov Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gadgetoid commented Jun 22, 2026

Uh oh!

Gadgetoid commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gadgetoid commented Jun 21, 2026 •

edited

Loading

codecov Bot commented Jun 21, 2026 •

edited

Loading

github-actions Bot commented Jun 21, 2026 •

edited

Loading