Skip to content

py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363

Draft
Gadgetoid wants to merge 4 commits into
micropython:masterfrom
pimoroni:gc-tailskip
Draft

py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363
Gadgetoid wants to merge 4 commits into
micropython:masterfrom
pimoroni:gc-tailskip

Conversation

@Gadgetoid

@Gadgetoid Gadgetoid commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Edit: I'm putting the final tests on optimising traversing gaps in the heap and finaliser scans which rely on this same technique. Please hold!

⚠️ Wrt to summary below, the description below covers the single case for tail traversal, but I've generalised this approach to skipping over free runs and finaliser scans. Needs some more testing to find the baseline impact of this change. Split into three commits to make it easier to bisect and test.

Tests in: free run doesn't do as much as I'd hoped, but it does something (~.5us/KB on PSRAM).

Summary

Somewhat a counterpart to: #19367

The GC sweep walked the allocation table one 16-byte block at a time. The interior of a large allocation is a long run of AT_TAIL entries (an ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa is sixteen).

Such a run never contains a HEAD/MARK, so it is wholly live or wholly dead and can be processed a word at a time - free the whole word when dead, or just advance the high-water mark when live - instead of touching every block. This is the dominant sweep cost for large pure-data buffers: e.g. on RP2040 a 96 KB bytearray collect drops from 7363us to 5747us.

The optimisation is gated behind a new MICROPY_GC_FAST_TAIL_SWEEP option, default disabled; when off the sweep is unchanged. Enabling it costs +112 bytes of flash and no RAM on RP2040 (Cortex-M0+).

The four ATB bytes are read via memcpy (with an alignment hint) rather than casting to uint32_t*, to avoid a strict-aliasing violation. A byte-step-to-alignment then word-step loop issues only aligned word accesses, so it is safe on cores that fault/trap on unaligned loads (Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V. MICROPY_GC_HOOK_LOOP is called within the word loop.

Testing

I have rigorously tested this in every way I can contrive, including byte for byte comparisons of memory state before/after GC and before/after this change and multi-hour runtime tests (including benchmarking) on our Tufty 2350 board. Testing covered both the Hazard3(RISCV)/M33 cores of the RP2350, and the RP2040's M0.

Note that the standard perfbench suite will do anything from very small deviations in performance to wild swings (I measured -30% in aes and -20% in pystone ). Since GC rarely changes (it's one of the dark corners of the project that takes work to understand and care to change) I end up with some new contrived script every time I want to exercise it with benchmarks.

Would the perftest suite be the right place for a GC benchmark? Given how painful GC is on PSRAM (granted, MicroPython never envisioned leaping from 192k to 8MB RAM) it might be worth measuring it specifically.

Trade-offs and Alternatives

This feature is gated behind a flag since it uses about 112 bytes of flash. It is particularly performant for the large allocations - such as heavy bytearrays or buffers - that we're using in our graphics API, but does add a very small cost to gc.collect(). This is very much an enable-if-you-use-PSRAM change, since GC is absolutely pathological when you start storing megabytes of buffers.

I have another chain to add a new no-scan bitflag which gives an orders of magnitude improvement for GC on large PSRAM buffers. Again extremely rigorously tested and benchmarked by every angle I can contrive- this PR is the cherry on top, but also self contained and the less invasive of the two changes.

A weekend of crunching, tests and tweaks has left me with something I'm not completely terrified to raise in a PR. Nonetheless I would appreciate some thoughts/independent verification on this to catch anything I might have missed.

Generative AI

I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.

(and of course I forgot to --signoff 🤦)

@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.51%. Comparing base (5770ced) to head (46b83a0).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19363   +/-   ##
=======================================
  Coverage   98.51%   98.51%           
=======================================
  Files         176      176           
  Lines       22904    22905    +1     
=======================================
+ Hits        22563    22564    +1     
  Misses        341      341           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Code size report:

Reference:  tools/mpy_ld.py: Allow overriding the internal MPY file name. [b49f098]
Comparison: rp2: Enable fast table scans. [merge of 46b83a0]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +1 +0.001% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    -8 -0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:  +168 +0.018% RPI_PICO_W
       samd:    -8 -0.003% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    -4 -0.001% VIRT_RV32

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jun 22, 2026
@Gadgetoid

Copy link
Copy Markdown
Contributor Author

Did some more crunching and benchmarking to discover I'd missed that this same optimisation applies in two other places (ATB all-free and FTB all-zero), currently working on implementing and verifying it, since it conceptually matches this change and they'd make a nice combined package.

Will change the flag to something like MICROPY_GC_FAST_TABLE_SCANS, though I'm open to suggestions.

The GC sweep walked the allocation table one 16-byte block at a time.
The interior of a large allocation is a long run of AT_TAIL entries (an
ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa sixteen).

Such a run never contains a HEAD/MARK, so it is wholly live or wholly
dead and can be processed a word at a time - free the whole word when
dead, or just advance the high-water mark when live - instead of
touching every block. This is the dominant sweep cost for large
pure-data buffers: on RP2040 a 96 KB bytearray collect drops from 7363
to 5747 us.

This introduces MICROPY_GC_FAST_TABLE_SCANS (default disabled), which
gates a family of word-at-a-time scans over the GC's per-block tables;
when off the behaviour is unchanged. This commit adds the first:
coalescing tail-block runs in the sweep. Following commits add free-run
coalescing and a finaliser-table skip under the same option. Enabling
this commit alone costs +112 bytes of flash and no RAM on RP2040.

The four ATB bytes are read via memcpy (with an alignment hint) rather
than casting to uint32_t*, to avoid a strict-aliasing violation. A
byte-step-to-alignment then word-step loop issues only aligned word
accesses, so it is safe on cores that fault on unaligned loads
(Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V.
MICROPY_GC_HOOK_LOOP is called within the word loop.

Signed-off-by: Phil Howard <github@gadgetoid.com>
@Gadgetoid Gadgetoid changed the title py/gc: Add fast tail-block sweep behind MICROPY_GC_FAST_TAIL_SWEEP. py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS. Jun 22, 2026
@Gadgetoid

Copy link
Copy Markdown
Contributor Author

Starting from the no-scan and building the optimised allocation table traversal on top gets us from ~130us/KB down to ~1.6us/KB. This is a very synthetic benchmark with large buffers and a small amount of baseline fragmentation, but it does more or less reflect the real world issues we ran into:

image

Extend the sweep's fast path (MICROPY_GC_FAST_TABLE_SCANS) to skip long
runs of free blocks a word at a time, not just all-tail runs. An
all-free ATB byte/word (0x00) needs no work - the blocks are already
free and not in use - so a large gap (e.g. a freed buffer below
still-live data, which keeps the high-water mark up) is crossed 16
blocks per word instead of one at a time. Shares the same word read and
byte-step-to-alignment dance as the tail-run coalescing.

On a Pico LiPo 2 (RP2350B, 8 MB PSRAM) a collect crossing a 2 MB free
hole below live data saves ~.5us/KB.

Signed-off-by: Phil Howard <github@gadgetoid.com>
gc_sweep_run_finalisers() reads the finaliser table (FTB) byte by byte
up to the high-water mark to find blocks that may have a __del__. The
FTB over a large pure-data buffer is all zero, yet is still scanned a
byte at a time - costly when the table lives in slow PSRAM. Under
MICROPY_GC_FAST_TABLE_SCANS, skip all-zero FTB words (32 blocks) at a
time, using the same aligned-word read as the sweep. A non-zero word (a
block that may have a finaliser) always falls through to the unchanged
per-byte handler, so finaliser execution is unaffected.

Applied only when the FTB is the sole table scanned here (weakref
disabled). On a Pico LiPo 2 the finaliser scan for a live 4 MB buffer
drops from ~4.0 to ~1.8 ms.

Signed-off-by: Phil Howard <github@gadgetoid.com>
For CI, build tests only.

Signed-off-by: Phil Howard <github@gadgetoid.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants