py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363
Draft
Gadgetoid wants to merge 4 commits into
Draft
py/gc: Add per-word GC table scans behind MICROPY_GC_FAST_TABLE_SCANS.#19363Gadgetoid wants to merge 4 commits into
Gadgetoid wants to merge 4 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19363 +/- ##
=======================================
Coverage 98.51% 98.51%
=======================================
Files 176 176
Lines 22904 22905 +1
=======================================
+ Hits 22563 22564 +1
Misses 341 341 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Code size report: |
Contributor
Author
|
Did some more crunching and benchmarking to discover I'd missed that this same optimisation applies in two other places (ATB all-free and FTB all-zero), currently working on implementing and verifying it, since it conceptually matches this change and they'd make a nice combined package. Will change the flag to something like |
The GC sweep walked the allocation table one 16-byte block at a time. The interior of a large allocation is a long run of AT_TAIL entries (an ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa sixteen). Such a run never contains a HEAD/MARK, so it is wholly live or wholly dead and can be processed a word at a time - free the whole word when dead, or just advance the high-water mark when live - instead of touching every block. This is the dominant sweep cost for large pure-data buffers: on RP2040 a 96 KB bytearray collect drops from 7363 to 5747 us. This introduces MICROPY_GC_FAST_TABLE_SCANS (default disabled), which gates a family of word-at-a-time scans over the GC's per-block tables; when off the behaviour is unchanged. This commit adds the first: coalescing tail-block runs in the sweep. Following commits add free-run coalescing and a finaliser-table skip under the same option. Enabling this commit alone costs +112 bytes of flash and no RAM on RP2040. The four ATB bytes are read via memcpy (with an alignment hint) rather than casting to uint32_t*, to avoid a strict-aliasing violation. A byte-step-to-alignment then word-step loop issues only aligned word accesses, so it is safe on cores that fault on unaligned loads (Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V. MICROPY_GC_HOOK_LOOP is called within the word loop. Signed-off-by: Phil Howard <github@gadgetoid.com>
Contributor
Author
Extend the sweep's fast path (MICROPY_GC_FAST_TABLE_SCANS) to skip long runs of free blocks a word at a time, not just all-tail runs. An all-free ATB byte/word (0x00) needs no work - the blocks are already free and not in use - so a large gap (e.g. a freed buffer below still-live data, which keeps the high-water mark up) is crossed 16 blocks per word instead of one at a time. Shares the same word read and byte-step-to-alignment dance as the tail-run coalescing. On a Pico LiPo 2 (RP2350B, 8 MB PSRAM) a collect crossing a 2 MB free hole below live data saves ~.5us/KB. Signed-off-by: Phil Howard <github@gadgetoid.com>
gc_sweep_run_finalisers() reads the finaliser table (FTB) byte by byte up to the high-water mark to find blocks that may have a __del__. The FTB over a large pure-data buffer is all zero, yet is still scanned a byte at a time - costly when the table lives in slow PSRAM. Under MICROPY_GC_FAST_TABLE_SCANS, skip all-zero FTB words (32 blocks) at a time, using the same aligned-word read as the sweep. A non-zero word (a block that may have a finaliser) always falls through to the unchanged per-byte handler, so finaliser execution is unaffected. Applied only when the FTB is the sole table scanned here (weakref disabled). On a Pico LiPo 2 the finaliser scan for a live 4 MB buffer drops from ~4.0 to ~1.8 ms. Signed-off-by: Phil Howard <github@gadgetoid.com>
For CI, build tests only. Signed-off-by: Phil Howard <github@gadgetoid.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Edit: I'm putting the final tests on optimising traversing gaps in the heap and finaliser scans which rely on this same technique. Please hold!
Tests in: free run doesn't do as much as I'd hoped, but it does something (~.5us/KB on PSRAM).
Summary
Somewhat a counterpart to: #19367
The GC sweep walked the allocation table one 16-byte block at a time. The interior of a large allocation is a long run of AT_TAIL entries (an ATB byte of 0xaa is four tail blocks, a 32-bit word 0xaaaaaaaa is sixteen).
Such a run never contains a HEAD/MARK, so it is wholly live or wholly dead and can be processed a word at a time - free the whole word when dead, or just advance the high-water mark when live - instead of touching every block. This is the dominant sweep cost for large pure-data buffers: e.g. on RP2040 a 96 KB bytearray collect drops from 7363us to 5747us.
The optimisation is gated behind a new
MICROPY_GC_FAST_TAIL_SWEEPoption, default disabled; when off the sweep is unchanged. Enabling it costs +112 bytes of flash and no RAM on RP2040 (Cortex-M0+).The four ATB bytes are read via memcpy (with an alignment hint) rather than casting to uint32_t*, to avoid a strict-aliasing violation. A byte-step-to-alignment then word-step loop issues only aligned word accesses, so it is safe on cores that fault/trap on unaligned loads (Cortex-M0+, Hazard3 RISC-V); validated on M0+, Cortex-M33 and RISC-V. MICROPY_GC_HOOK_LOOP is called within the word loop.
Testing
I have rigorously tested this in every way I can contrive, including byte for byte comparisons of memory state before/after GC and before/after this change and multi-hour runtime tests (including benchmarking) on our Tufty 2350 board. Testing covered both the Hazard3(RISCV)/M33 cores of the RP2350, and the RP2040's M0.
Note that the standard perfbench suite will do anything from very small deviations in performance to wild swings (I measured -30% in aes and -20% in pystone ). Since GC rarely changes (it's one of the dark corners of the project that takes work to understand and care to change) I end up with some new contrived script every time I want to exercise it with benchmarks.
Would the perftest suite be the right place for a GC benchmark? Given how painful GC is on PSRAM (granted, MicroPython never envisioned leaping from 192k to 8MB RAM) it might be worth measuring it specifically.
Trade-offs and Alternatives
This feature is gated behind a flag since it uses about 112 bytes of flash. It is particularly performant for the large allocations - such as heavy bytearrays or buffers - that we're using in our graphics API, but does add a very small cost to
gc.collect(). This is very much an enable-if-you-use-PSRAM change, since GC is absolutely pathological when you start storing megabytes of buffers.I have another chain to add a new no-scan bitflag which gives an orders of magnitude improvement for GC on large PSRAM buffers. Again extremely rigorously tested and benchmarked by every angle I can contrive- this PR is the cherry on top, but also self contained and the less invasive of the two changes.
A weekend of crunching, tests and tweaks has left me with something I'm not completely terrified to raise in a PR. Nonetheless I would appreciate some thoughts/independent verification on this to catch anything I might have missed.
Generative AI
I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.
(and of course I forgot to
--signoff🤦)