Skip to content

Improving Unicode support in MicroPython.#18854

Open
Josverl wants to merge 23 commits into
micropython:masterfrom
Josverl:core/unicode
Open

Improving Unicode support in MicroPython.#18854
Josverl wants to merge 23 commits into
micropython:masterfrom
Josverl:core/unicode

Conversation

@Josverl
Copy link
Copy Markdown
Contributor

@Josverl Josverl commented Feb 20, 2026

Summary

Improving Unicode support in MicroPython is not just a technical improvement—it's a commitment to global accessibility, educational equity, and inclusive computing. With a limited memory cost of ~0.05%, this enhancement removes barriers for users worldwide and aligns MicroPython with CPython compatibility goals while staying true to its mission of bringing Python to everyone, everywhere. PEP 3131 (2007) established Unicode identifiers as a Python standard, UTF-8 is the universal encoding for modern software development, and MicroPython should support these standards where possible.

This pull request addresses multiple issues related to the handling of Unicode characters and encoding in MicroPython.

  • Implements validation for the bytes.decode() method to ensure only valid encodings are accepted, specifically utf-8, utf8, and ascii.
  • Introduce support for 'ignore' and 'replace' error handlers in bytes.decode(), fixing issues where invalid encodings did not raise appropriate exceptions. The API is now CPython compatible, only not accepting kwargs.
  • Enhances string formatting to correctly handle multi-byte UTF-8 characters, addressing issues where character codes greater than 127 were truncated.
  • Fixes the str.center() method to accurately count Unicode characters instead of bytes, ensuring proper padding for multi-byte characters.

Fixes: #15849
Fixes: #3364
Fixes: #17827

Related PRs:

This is a re-submit of #18670 which was unrecoverably closed due to user error.

Testing

Testing has been conducted across various platforms, including ESP32, RP2, and Unix , with all relevant tests passing successfully. This includes new tests for encoding validation, error handling, and Unicode character formatting.

As Unicode offers a very large set of codepoints I have based the testing on Unicode test data in 127 languages and script combinations that are available in unicode_mpy
Using this test set has allowed me to find additional issues that were not yet reported.

The new tests that have been added to the MicroPython test suite have been based on the examples provided in issues, and on issues found though this test set.

Trade-offs and Alternatives

  • Currently the unicode functionality is enabled progressively based on MICROPY_CONFIG_ROM_LEVEL to reduce memory impact on constrained systems. Perhaps it should be made available on all levels, and only disabled explicitly for the most constrained systems.
  • bytes.decode() does not accept kwargs to reduce firmware size.
  • The str.center() method does currently not currently handle the width correctly for multi-byte Unicode characters.
  • The current focus is of Left-to-Right (LTR) scripts, Specific functionality for RTL scrips have not been verified.
  • There are additional Unicode issues with the MicroPython REPL that are not part of this PR. The handling of characters with differentiated widths requires lookup tables that take about 88kb, which is a significant size.
  • Also the current test tooling is not capable to run tests in the REPL context.

These issues can be addressed in future PRs.

@Josverl Josverl added py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support. labels Feb 20, 2026
@Josverl Josverl requested a review from projectgus February 20, 2026 23:02
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 20, 2026

Code size report:

Reference:  samd/mphalport: Run events at least once in mp_hal_delay_ms. [af38ee1]
Comparison: refactor: Use QSTR and common error message. [merge of e6e4846]
  mpy-cross:  +464 +0.122% 
   bare-arm:    +0 +0.000% 
minimal x86:    -4 -0.002% 
   unix x64:  +848 +0.099% standard
      stm32:  +456 +0.113% PYBV10
      esp32:  +668 +0.038% ESP32_GENERIC[incl +48(data)]
     mimxrt:  +392 +0.100% TEENSY40
        rp2:  +464 +0.050% RPI_PICO_W
       samd:  +416 +0.150% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:  +582 +0.127% VIRT_RV32

@tpbrisco
Copy link
Copy Markdown

@Josverl - the only changes I see are in the test suite? If you have a fork elsewhere, I can test any updated mqtt code. Over the weekend, I figured out that coercing the parameters in umqtt.publish to bytestrings works around the issue - so that if you have a fix for unicode handling, that's likely the right direction. Looking at the umqtt library, it looked like type validations in the methods would highlight the type mismatches and throw better errors, or ....

@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented Mar 10, 2026

@tpbrisco ,
You are correct that I added and updated several tests ,
but I see updates in :

it looked like type validations in the methods would highlight the type mismatches and throw better errors,

Adding static typing can help spot errors, but that does still assume that the runtime can handle utf-8

@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented Mar 10, 2026

Strangely coverage is failing on a test that was only introduced after the base of this PR.
Something strange must have happened when I deleted and then resurrected my fork

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.48%. Comparing base (dc33f04) to head (68ebe22).
⚠️ Report is 71 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #18854   +/-   ##
=======================================
  Coverage   98.47%   98.48%           
=======================================
  Files         176      176           
  Lines       22845    22900   +55     
=======================================
+ Hits        22497    22553   +56     
+ Misses        348      347    -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread py/objstr.c Outdated
@Josverl Josverl force-pushed the core/unicode branch 2 times, most recently from f2bbd3f to 39688d1 Compare April 2, 2026 21:03
Copy link
Copy Markdown
Contributor

@projectgus projectgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really impressive, @Josverl, and I'm also impressed at the amount of test coverage added. It would be great to have this additional unicode support in MicroPython.

I have a few relatively minor comments inline, just small things.

The one major question I have is about build/test coverage - how many permutations of the new config items are we building & testing now, and is that enough?

Comment thread py/mpconfig.h Outdated
Comment thread py/objstr.c
Comment thread py/objstr.c
Comment thread py/objstr.c
Comment thread tests/micropython/exception_split_heap.py Outdated
@Josverl Josverl force-pushed the core/unicode branch 4 times, most recently from ed4c42c to fb2ddb1 Compare May 1, 2026 16:15
@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented May 1, 2026

impressed at the amount of test coverage added.

😁 thanks, TDD really does work if the problem is clear. - However now I really can't let that last uncovered line slip by, so adding one more commit.

The one major question I have is about build/test coverage - how many permutations of the new config items are we building & testing now, and is that enough?

TLDR;

Looking at the following 3 most relevant ( I think) options:
MICROPY_PY_BUILTINS_STR_UNICODE, MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS
All three currently track the ROM level, so in practice they're either all‑on or all‑off on any given build — no port flips them independently.
There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

So effectively the current test suite only tests: all-on , and all-off, and few ports that always enable STR_UNICODE

Additional tests that with board variations that "mix things up", and run the test suite against these would guard against regressions in individual feature enablement.

See unicode_testing for initial experiments Based on top of this PR.
No additional regressions were found - but several tests are quite naive in their assumption on feature availability.
porting tests to unittest does help - but some

Unicode test matrix

🪄AI Research:

Config switches touched / referenced

Option Status in PR Default
MICROPY_PY_BUILTINS_STR_UNICODE existing, code paths heavily exercised ON at ROM_LEVEL ≥ EXTRA_FEATURES
MICROPY_PY_BUILTINS_STR_UNICODE_CHECK existing, used in new validation paths in objstr.c follows STR_UNICODE
MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS new in this PR (mpconfig.h) ON at ROM_LEVEL ≥ EXTRA_FEATURES

All three currently track the ROM level, so in practice they're either all‑on or all‑off on any given build — no port flips them independently.

Where each state is built and tested in CI

Mapping ci.sh + .github/workflows/*.yml against the per‑port MICROPY_CONFIG_ROM_LEVEL:

CI job Variant / board STR_UNICODE / CHECK / DECODE_ERRORS Tests executed?
ports_unix.minimal unix minimal (MINIMUM) OFF / OFF / OFF yes (reduced suite)
ports_unix.standard / standard_v2 unix standard (EXTRA) ON / ON / ON yes (full)
ports_unix.coverage / coverage_32bit unix coverage (EVERYTHING) ON / ON / ON yes (full)
ports_unix.nanbox / longlong / float / float_clang / gil_enabled / stackless_clang / settrace_stackless / repr_b unix EXTRA/EXTRA+overrides ON / ON / ON yes
ports_unix.qemu_arm / qemu_mips / qemu_riscv64 unix standard cross ON / ON / ON yes
ports_zephyr (qemu_cortex_m3) zephyr default (BASIC_FEATURES) OFF / OFF / OFF yes
ports_qemu (Cortex‑M, RV32, RV64, sabrelite, bigendian) qemu (EXTRA) ON / ON / ON yes
ports_webassembly pyscript (FULL_FEATURES) ON / ON / ON yes
ports_esp32 (S3/C3/C2/C5/C6/P4/S2 spiram) EXTRA ON / ON / ON build‑only
ports_esp8266 EXTRA ON / ON / ON build‑only
ports_rp2 EXTRA ON / ON / ON build‑only
ports_stm32 (pyb / nucleo / misc, incl. B_L072Z_LRWAN1 CORE_FEATURES) mixed mostly ON, B_L072Z_LRWAN1=OFF build‑only
ports_mimxrt FULL_FEATURES ON build‑only
ports_samd samd21=BASIC / samd51=FULL OFF / ON build‑only
ports_alif, ports_renesas-ra, ports_nrf, ports_cc3200, ports_powerpc, ports_windows mixed (nrf/cc3200/windows force STR_UNICODE=1 regardless of ROM level) varies build‑only

run-tests.py auto‑detects unicode from the running interpreter (detect_test_platform, run-tests.py) and only walks unicode when args.unicode is true (run-tests.py). The new bytes_decode_ignore.py / bytes_decode_replace.py self‑SKIP when BYTES_DECODE_ERRORS=0.

What the suite actually exercises

  • All‑ON path (STR_UNICODE=1, STR_UNICODE_CHECK=1, BYTES_DECODE_ERRORS=1): well covered — unix standard / coverage / nanbox / longlong / float / repr_b, all qemu jobs, webassembly, plus build‑only assurance on every hardware port at EXTRA/FULL. The new tests (exception_invalid_utf8.py, str_center.py, unicode_char_format.py, bytes_decode_*, cpydiff/types_bytes_decode_*) execute here.
  • All‑OFF path: covered by ports_unix.minimal and ports_zephyr (BASIC). Both run the main suite, and the new bytes/unicode tests skip cleanly. samd21, bare-arm, B_L072Z_LRWAN1, and minimal get build‑only coverage of the OFF code paths.
  • Mixed permutations: not tested anywhere. There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

Suggested CI improvements

  1. Add a "unicode permutations" build job — cheapest, highest value. In ci.sh add one or two unix variants (or just CFLAGS_EXTRA= overrides on the existing standard build) that compile with:

    • -DMICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS=0 (with STR_UNICODE=1) — proves the new option's OFF branch still compiles and the bytes_decode_* self‑skip works.
    • -DMICROPY_PY_BUILTINS_STR_UNICODE_CHECK=0 (with STR_UNICODE=1) — proves the validation guards in objstr.c (decode/encode/format) still build and behave correctly when checks are disabled.
    • Optionally -DMICROPY_PY_BUILTINS_STR_UNICODE=1 -DMICROPY_PY_BUILTINS_STR_UNICODE_CHECK=1 -DMICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS=1 on top of minimal to verify forced‑ON on a low ROM level.
      These can be quick build‑and‑run jobs reusing the existing unix Makefile, no new variant directory required.
  2. Run the test suite on at least one BASIC/MINIMUM hardware build — currently only zephyr-on-qemu does. Adding a qemu samd21 or similar would harden the OFF path against regressions on real microcontroller layouts (different mp_int_t, alignment, OBJ_REPR).

  3. Verify the unix_minimal test selection actually exercises the new tests as SKIP. The minimal run today uses the basics subset; a one‑line check that bytes_decode_ignore.py is included (or explicitly add it under unix_minimal_run_tests) ensures the OFF branch of the new option doesn't silently rot.

  4. Document the matrix in unicode_support.rst (already added by this PR) — adding a small "tested combinations" table makes the intent of STR_UNICODE_CHECK vs BYTES_DECODE_ERRORS decoupling explicit, and gives future contributors guidance when introducing further unicode‑related code paths.

  5. Long‑term: the three flags being independent in the source but coupled in defaults is a maintenance hazard. Either (a) collapse STR_UNICODE_CHECK into STR_UNICODE if no port ever wants them split, or (b) keep them split and lock in the matrix with the build jobs from try/except leads to infinite loop with growing memory usage #1 so the independence is real.

The minimum useful addition is item #1 — a few extra unix builds (~1‑2 minutes of CI each) cover the previously‑untested permutations introduced or affected by this PR.

@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented May 4, 2026

Now another question is if the feature's rom level should be aligned to PR #19168, and be enabled by default following Unicode, with the ability to override to allow fine tuning.

This commit changes the default level at which unicode is enabled from "extra" down to "basic".

@projectgus
Copy link
Copy Markdown
Contributor

projectgus commented May 6, 2026

There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

Actually I think this is probably OK the way it is now, especially now that ignore+replace are enabled as one flag. Most builds will only use the ROM level, and the niche combinations here are unusual enough that they're probably not worth adding more complexity for.

Will leave for @dpgeorge to take another look from here.

@projectgus projectgus requested a review from dpgeorge May 6, 2026 07:04
@dpgeorge
Copy link
Copy Markdown
Member

@Josverl please can you rebase this on latest master to resolve the conflict.

@dpgeorge dpgeorge added this to the release-1.29.0 milestone May 13, 2026
Josverl added 3 commits May 20, 2026 20:09
Added feature detection at the start of bytes_decode_errors.py test to
skip gracefully when decode method is not available.
(requires MICROPY_CPYTHON_COMPAT).

This fixes test failures on minimal builds and Windows builds that
may not have this feature enabled.

Test now:
- Checks if decode method exists before running tests
- Prints "SKIP" and exits cleanly if decode is not available
- Works correctly on both full-featured and minimal builds

Verified:
- Standard unix build: All tests pass (14 testcases)
- Minimal unix build: Test skips cleanly
- All bytes/bytearray/string tests pass (82 tests, 2191 testcases)

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Comment thread tests/unicode/str_center.py.exp Outdated
Comment thread tests/ports/unix/extra_coverage.py.exp
Comment thread py/objstr.c Outdated
Comment thread py/objstr.c Outdated
@dpgeorge
Copy link
Copy Markdown
Member

dpgeorge commented Jun 5, 2026

@Josverl I'm working my way through this PR.

I really think the changes that fix #17855 (all changes to py/objexcept.c, py/gc.h and py/gc.c) should be moved to a separate PR. Those changes make up a significant part of the C-code changes in this PR and are fixing a very niche bug. So it would be best to have them in a separate PR (the fix and the related tests) so they can be evaluated independently (in particular the code size change).

Comment thread docs/reference/unicode_support.rst Outdated
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Copy link
Copy Markdown
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating. I've made some more suggestions on ways to reduce code size.

Comment thread py/objstr.c Outdated
Comment thread py/objstr.c Outdated
Comment thread py/objstr.c
Comment thread py/objstr.c Outdated
Comment thread py/objstr.c Outdated
Josverl and others added 16 commits June 6, 2026 21:47
Only accepts `utf-8`, `utf8` or `ascii`

Fixes micropython#15849

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Fixes: issue 3364
Fixes: issue 13084

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Fixes Issue 17827

Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Prevent the test from failing by not testing known unsupported characters.
These will be documented in a cpydiff test.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
This allows simpler skipping of tests based on enabled capabilities.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Removed the dead \U%08x branch in uni_print_quoted.
Characters ≥ 0x110000 are impossible in valid UTF-8, so the branch
was unreachable. It's replaced by a single else that handles
surrogates (0xD800–0xDFFF) with \u%04x.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Added  multi-byte sequences to improve test coverage.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
+ Correct a few typos in comments.

Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented Jun 6, 2026

@dpgeorge, Thanks for the tips, that does save 80 more bytes in addition to the earlier reductions.

I just kicked off a script to track the unix-standard changes using Membrowse

I Notice coverage is failing - but that seems to be an sig/upload problem - Ill retry it tomorrow.

@Josverl
Copy link
Copy Markdown
Contributor Author

Josverl commented Jun 7, 2026

gpg: Signature made Tue Apr 21 19:28:03 2026 UTC
gpg: using RSA key 27034E7FDB850E0BBC2C62FF806BB28AED779869
gpg: Can't check signature: No public key
==> Could not verify signature. Please contact Codecov if problem continues
Exiting...

  • I have not yet merged the recent optimizations back into the respective commits, but am happy to do that once we are done optimizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

py-core Relates to py/ directory in source unicode Bugs and enhancements related to Unicode/UTF-8 support.

Projects

None yet

4 participants