Core/fix src invalid utf by Josverl · Pull Request #19321 · micropython/micropython

Josverl · 2026-06-07T21:42:30Z

Summary

This PR has been split off from PR #18854 , ( And should be based on that)

only the last 3 commits are unique to this PR

This includes the changes that fix #17855 (all changes to py/objexcept.c, py/gc.h and py/gc.c) As these are fixing a very niche bug.
This PR is intended so they can be evaluated independently (in particular the code size change).

Testing

The relevant test are included in this PR

Generative AI

I used generative AI tools when creating this PR, but a human has checked the
code and is responsible for the code and the description above.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Raises LookupError for not implements error handlers Improves repr() rendering for unicode. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Added feature detection at the start of bytes_decode_errors.py test to skip gracefully when decode method is not available. (requires MICROPY_CPYTHON_COMPAT). This fixes test failures on minimal builds and Windows builds that may not have this feature enabled. Test now: - Checks if decode method exists before running tests - Prints "SKIP" and exits cleanly if decode is not available - Works correctly on both full-featured and minimal builds Verified: - Standard unix build: All tests pass (14 testcases) - Minimal unix build: Test skips cleanly - All bytes/bytearray/string tests pass (82 tests, 2191 testcases) Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Only accepts `utf-8`, `utf8` or `ascii` Fixes micropython#15849 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

Fixes: issue 3364 Fixes: issue 13084 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Fixes Issue 17827 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Prevent the test from failing by not testing known unsupported characters. These will be documented in a cpydiff test. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

This allows simpler skipping of tests based on enabled capabilities. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Removed the dead \U%08x branch in uni_print_quoted. Characters ≥ 0x110000 are impossible in valid UTF-8, so the branch was unreachable. It's replaced by a single else that handles surrogates (0xD800–0xDFFF) with \u%04x. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Added multi-byte sequences to improve test coverage. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

+ Correct a few typos in comments. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

The MP_IS_COMPRESSED_ROM_STRING macro in qstr.h only checkes if the first byte of a string is 0xff (compression marker). This caused user-allocated strings on the heap that happened to start with 0xff (utf-8 continuation byte) to be incorrectly treated as compressed ROM string. Modified decompress_error_text_maybe() to add heap pointer validation before attempting decompression. The fix checks if the pointer is in the GC heap - if it is, it cannot be a ROM compressed string and should not be decompressed. The validation uses the same logic as the VERIFY_PTR macro from gc.c Alternative to : micropython#17862 Fixes: micropython#17855 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

codecov · 2026-06-07T21:55:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.48%. Comparing base (dc33f04) to head (0aa9852).
⚠️ Report is 71 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #19321   +/-   ##
=======================================
  Coverage   98.47%   98.48%           
=======================================
  Files         176      176           
  Lines       22845    22905   +60     
=======================================
+ Hits        22497    22558   +61     
+ Misses        348      347    -1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-06-07T21:59:35Z

Code size report:

Reference:  samd/mphalport: Run events at least once in mp_hal_delay_ms. [af38ee1]
Comparison: tests: Test exception handling with heap-allocated unicode-like strings. [merge of 0aa9852]
  mpy-cross:  +464 +0.122% 
   bare-arm:    +0 +0.000% 
minimal x86:   +20 +0.011% 
   unix x64:  +864 +0.101% standard
      stm32:  +472 +0.117% PYBV10
      esp32:  +712 +0.041% ESP32_GENERIC[incl +48(data)]
     mimxrt:  +416 +0.106% TEENSY40
        rp2:  +488 +0.053% RPI_PICO_W
       samd:  +416 +0.150% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:  +610 +0.133% VIRT_RV32

Josverl and others added 26 commits May 20, 2026 20:09

tests/basics: Add bytes.decode() tests.

81a6910

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/config: Add MICROPY_PY_BUILTINS_BYTES_DECODE_REPLACE.

c856515

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/unicode: Implement bytes.decode() 'ignore' and 'replace' modes.

82d6e5d

Raises LookupError for not implements error handlers Improves repr() rendering for unicode. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/basics/bytes_decode_encoding: Add tests to validate encoding.

d34289e

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/unicode: Add tests for unicode character formatting.

0fe170b

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/unicode: Tests exception handling for strings starting with 0xff.

dae625b

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/objstr: Validate encoding for decode and encode.

527b618

Only accepts `utf-8`, `utf8` or `ascii` Fixes micropython#15849 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

py/objstr: Enhance utf-8 character handling in string formatting.

2e27300

Fixes: issue 3364 Fixes: issue 13084 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/objstr: Fix str_center for Unicode strings.

57bb48f

Fixes Issue 17827 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

docs: Document Unicode support and limitiations.

77f6830

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/unicode: Remove known differences from test.

b513704

Prevent the test from failing by not testing known unsupported characters. These will be documented in a cpydiff test. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/cpydiff: Document unicode differences.

57e679e

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/basics/bytes_decode: Split tests for ignore/replace.

4e9a5ec

This allows simpler skipping of tests based on enabled capabilities. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests/unicode/unicode_char_format: Test unicode character formatting.

0c3dd02

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

test/cpydiff: Add tests to document Unicode differences.

8fae63d

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests: Update t-string test cases for unicode.

5c676e1

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests: Enhance byte and string decoding tests.

17d1518

Added multi-byte sequences to improve test coverage. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/objstr: Optimize character handling and encoding validation.

2f93c22

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

run-tests: Specify UTF-8 encoding when opening test files.

7e13ed0

+ Correct a few typos in comments. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

py/objstr: Refactor to use mp_print_char helper function.

4d26ee4

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

refactor: Use QSTR and common error message.

e6e4846

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

tests/exception_splitheap: Test Exceptions with ROM strings.

8722186

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>

tests: Test exception handling with heap-allocated unicode-like strings.

0aa9852

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>

Josverl mentioned this pull request Jun 7, 2026

Improving Unicode support in MicroPython. #18854

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Core/fix src invalid utf#19321

Core/fix src invalid utf#19321
Josverl wants to merge 26 commits into
micropython:masterfrom
Josverl:core/fix_src_invalid_utf

Josverl commented Jun 7, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Josverl commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Generative AI

Uh oh!

codecov Bot commented Jun 7, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Josverl commented Jun 7, 2026 •

edited

Loading