Skip to content

py/gc: Track data and skip scan, MICROPY_GC_NO_SCAN#19367

Draft
Gadgetoid wants to merge 3 commits into
micropython:masterfrom
pimoroni:gc-no-scan
Draft

py/gc: Track data and skip scan, MICROPY_GC_NO_SCAN#19367
Gadgetoid wants to merge 3 commits into
micropython:masterfrom
pimoroni:gc-no-scan

Conversation

@Gadgetoid

Copy link
Copy Markdown
Contributor

Summary

Buffers don't contain pointers. Don't scan them for pointers. This was never really a problem until we had 8MB of PSRAM with in-RAM font data and images.

Currently this brings new m_new_no_scan() and m_malloc_no_scan() methods for data that's guaranteed (by a contentious user who understands how painful use-after-free bugs are to trace) to contain absolutely 100% no pointers. These mirror the existing methods, using a new GC_ALLOC_FLAG_NO_SCAN flag.

For better or worse we'll be carrying this change downstream for Tufty 2350, since GC hangups are painful when trying to hit 30-60FPS screen updates. This doesn't eliminate them, but turns a 300ms pause into a 30ms one, the rest of which is handled by #19363

Might be of interest to @sfe-SparkFro

Testing

Aggressive multi-hour tests of both real-world examples (on Tufty 2350) and synthetic GC thrashing benchmarks.

Again this change does not make much of an impact on perfbench since we don't really benchmark GC, and in some cases it can cause a net loss (RP2040 XIP cache lottery).

Trade-offs and Alternatives

This feature sacrifices heap for the additional flag bit in the allocation table, and is thus default disabled. I'd recommend everyone shipping a board with PSRAM enable it as a matter of course, and suggest that leaving the RAM/performance tradeoff to the vendor of each board.

This is a big change to a scary part of MicroPython and as such I'm raising it as a draft in the hope others will exercise it downstream and feed back. I don't expect or need it to be merged, but it's fun to share!

Generative AI

I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.

The mark phase conservatively scans every word of every reachable block
for pointers, so a large bytearray/array buffer is scanned in full on
every collection despite holding no pointers. Add an optional per-block
"no-scan table" (NTB, 1 bit/block, like the finaliser/weakref tables) and
a GC_ALLOC_FLAG_NO_SCAN; tagged head blocks are marked but their contents
are not scanned. A no-scan block has no child pointers, so the mark phase
also skips the chain-walk for it (n_blocks left 0) and avoids re-reading
the allocation table for every block of the buffer just to mark it - this
matters for large buffers in slow PSRAM. The tag is written on every
allocation (so a reused block never inherits a stale bit) and preserved
across realloc moves.

Exposed as m_new_no_scan() / m_malloc_no_scan(), which alias plain
m_new()/gc_alloc() when disabled, and gated behind MICROPY_GC_NO_SCAN
(default off). This commit adds the mechanism only; callers are converted
separately.

Signed-off-by: Phil Howard <github@gadgetoid.com>
Tag the buffers that only ever hold raw data (never heap pointers) with
m_new_no_scan(), so the GC mark phase skips scanning them once
MICROPY_GC_NO_SCAN is enabled (a no-op otherwise):

	py/objarray.c: array/bytearray item storage.
	py/objstr.c:   str/bytes payloads.
	py/vstr.c:     the vstr builder, growth via gc_realloc preserves the tag.

Signed-off-by: Phil Howard <github@gadgetoid.com>
For CI, build tests only.

Signed-off-by: Phil Howard <github@gadgetoid.com>
@github-actions

Copy link
Copy Markdown

Code size report:

Reference:  tools/mpy_ld.py: Allow overriding the internal MPY file name. [b49f098]
Comparison: rp2: Enable no-scan GC. [merge of 1fb9532]
  mpy-cross:    +0 +0.000% 
   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:    +0 +0.000% standard
      stm32:    +0 +0.000% PYBV10
      esp32:    +0 +0.000% ESP32_GENERIC
     mimxrt:    +0 +0.000% TEENSY40
        rp2:  +108 +0.012% RPI_PICO_W[incl +4(bss)]
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:    +0 +0.000% VIRT_RV32

@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.51%. Comparing base (b49f098) to head (1fb9532).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19367   +/-   ##
=======================================
  Coverage   98.51%   98.51%           
=======================================
  Files         176      176           
  Lines       22904    22905    +1     
=======================================
+ Hits        22563    22564    +1     
  Misses        341      341           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Gadgetoid

Copy link
Copy Markdown
Contributor Author

This incredibly unhelpful graph illustrates that there's a very real, if vanishingly small cost to adding this change, affecting routing garbage collection in smaller memory environments.

It's only paid if the feature is turned on however, and the relative benefit for even modest buffers (96k is absolutely peanuts on 8MB PSRAM, but this was tested on an RP2040 to show the worst case tradeoff) far, far outweighs any cost.

image

And here's the same test on a Pico LiPo 2 with 8MB PSRAM enabled:

image

What each test measures

  • frame_manual - Per-frame time (us) of a render loop that allocates a 1 KB bytearray + a small list each frame and calls gc.collect() EVERY frame (predictable pacing). p50=median frame, p99/max=worst-case stalls.
  • frame_ondemand - Same render loop with no manual/threshold control - GC only fires when the heap fills (cheap average, large spikes).
  • frame_threshold - Same render loop, but using gc.threshold() auto-collection instead of collecting manually.
  • churn_per_alloc_ns - Per-allocation time (ns) churning 50000 transient 4-element lists - allocation throughput including amortised GC.
  • collect_empty_us - gc.collect() pause (us) on a near-empty heap - the fixed overhead of a collection.
  • collect_3000_lists_us - gc.collect() pause (us) with 3000 live small lists (real pointers) - these must be scanned, so no-scan cannot help.
  • free_start - Free heap (bytes) after a full collect at start. Higher is better; the no-scan table costs ~1 bit/block of heap.
  • collect_96k_bytearray_us - gc.collect() pause (us) with one live 96 KB bytearray. The headline case: a large pure-data buffer the mark phase would otherwise scan word-by-word for pointers.

Note that the 96k bytearray is only allocated in that specific test, since it would massively skew the other tests in favour of this feature being turned on.

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jun 22, 2026
@Gadgetoid

Copy link
Copy Markdown
Contributor Author

And since those were so bad, here's a graph which cuts right to the gist of the change:

image

Again I'm shooting low here, this is just the difference in collect times in SRAM, which probably makes it a reasonable sell even for a stock Pico 2 / Pico 2 W. PSRAM's speed (or lack thereof) compounds this effect.

@Gadgetoid

Copy link
Copy Markdown
Contributor Author

Since this ties in strongly with the optimised tail scan (#19363) here's a graph of them working together, again just SRAM:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants