Improving Unicode support in MicroPython. by Josverl · Pull Request #18854 · micropython/micropython

Josverl · 2026-02-20T23:01:52Z

Summary

Improving Unicode support in MicroPython is not just a technical improvement—it's a commitment to global accessibility, educational equity, and inclusive computing. With a limited memory cost of ~0.05%, this enhancement removes barriers for users worldwide and aligns MicroPython with CPython compatibility goals while staying true to its mission of bringing Python to everyone, everywhere. PEP 3131 (2007) established Unicode identifiers as a Python standard, UTF-8 is the universal encoding for modern software development, and MicroPython should support these standards where possible.

This pull request addresses multiple issues related to the handling of Unicode characters and encoding in MicroPython.

Implements validation for the bytes.decode() method to ensure only valid encodings are accepted, specifically utf-8, utf8, and ascii.
Introduce support for 'ignore' and 'replace' error handlers in bytes.decode(), fixing issues where invalid encodings did not raise appropriate exceptions. The API is now CPython compatible, only not accepting kwargs.
Enhances string formatting to correctly handle multi-byte UTF-8 characters, addressing issues where character codes greater than 127 were truncated.
Fixes the str.center() method to accurately count Unicode characters instead of bytes, ensuring proper padding for multi-byte characters.

Fixes: #15849
Fixes: #3364
Fixes: #17827

Related PRs:

Fix multiple unicode issues in mpremote. #18853

This is a re-submit of #18670 which was unrecoverably closed due to user error.

Testing

Testing has been conducted across various platforms, including ESP32, RP2, and Unix , with all relevant tests passing successfully. This includes new tests for encoding validation, error handling, and Unicode character formatting.

As Unicode offers a very large set of codepoints I have based the testing on Unicode test data in 127 languages and script combinations that are available in unicode_mpy
Using this test set has allowed me to find additional issues that were not yet reported.

The new tests that have been added to the MicroPython test suite have been based on the examples provided in issues, and on issues found though this test set.

Trade-offs and Alternatives

Currently the unicode functionality is enabled progressively based on MICROPY_CONFIG_ROM_LEVEL to reduce memory impact on constrained systems. Perhaps it should be made available on all levels, and only disabled explicitly for the most constrained systems.
bytes.decode() does not accept kwargs to reduce firmware size.
The str.center() method does currently not currently handle the width correctly for multi-byte Unicode characters.
The current focus is of Left-to-Right (LTR) scripts, Specific functionality for RTL scrips have not been verified.
There are additional Unicode issues with the MicroPython REPL that are not part of this PR. The handling of characters with differentiated widths requires lookup tables that take about 88kb, which is a significant size.
Also the current test tooling is not capable to run tests in the REPL context.

These issues can be addressed in future PRs.

github-actions · 2026-02-20T23:18:16Z

Code size report:

Reference:  samd/mphalport: Run events at least once in mp_hal_delay_ms. [af38ee1]
Comparison: refactor: Use QSTR and common error message. [merge of e6e4846]
  mpy-cross:  +464 +0.122% 
   bare-arm:    +0 +0.000% 
minimal x86:    -4 -0.002% 
   unix x64:  +848 +0.099% standard
      stm32:  +456 +0.113% PYBV10
      esp32:  +668 +0.038% ESP32_GENERIC[incl +48(data)]
     mimxrt:  +392 +0.100% TEENSY40
        rp2:  +464 +0.050% RPI_PICO_W
       samd:  +416 +0.150% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:  +582 +0.127% VIRT_RV32

tpbrisco · 2026-03-10T19:38:32Z

@Josverl - the only changes I see are in the test suite? If you have a fork elsewhere, I can test any updated mqtt code. Over the weekend, I figured out that coercing the parameters in umqtt.publish to bytestrings works around the issue - so that if you have a fix for unicode handling, that's likely the right direction. Looking at the umqtt library, it looked like type validations in the methods would highlight the type mismatches and throw better errors, or ....

Josverl · 2026-03-10T21:03:11Z

@tpbrisco ,
You are correct that I added and updated several tests ,
but I see updates in :

gc.c / .h , mpconfig.h, objexecpt.c , objstr.c and objstrunicode.c
49ac6c4, 8235713, 1310bf1

it looked like type validations in the methods would highlight the type mismatches and throw better errors,

Adding static typing can help spot errors, but that does still assume that the runtime can handle utf-8

Josverl · 2026-03-10T22:48:28Z

Strangely coverage is failing on a test that was only introduced after the base of this PR.
Something strange must have happened when I deleted and then resurrected my fork

codecov · 2026-03-20T22:48:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.48%. Comparing base (dc33f04) to head (68ebe22).
⚠️ Report is 71 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #18854   +/-   ##
=======================================
  Coverage   98.47%   98.48%           
=======================================
  Files         176      176           
  Lines       22845    22900   +55     
=======================================
+ Hits        22497    22553   +56     
+ Misses        348      347    -1

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

projectgus

This is really impressive, @Josverl, and I'm also impressed at the amount of test coverage added. It would be great to have this additional unicode support in MicroPython.

I have a few relatively minor comments inline, just small things.

The one major question I have is about build/test coverage - how many permutations of the new config items are we building & testing now, and is that enough?

Josverl · 2026-05-01T19:51:30Z

impressed at the amount of test coverage added.

😁 thanks, TDD really does work if the problem is clear. - However now I really can't let that last uncovered line slip by, so adding one more commit.

The one major question I have is about build/test coverage - how many permutations of the new config items are we building & testing now, and is that enough?

TLDR;

Looking at the following 3 most relevant ( I think) options:
MICROPY_PY_BUILTINS_STR_UNICODE, MICROPY_PY_BUILTINS_STR_UNICODE_CHECK, MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS
All three currently track the ROM level, so in practice they're either all‑on or all‑off on any given build — no port flips them independently.
There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

So effectively the current test suite only tests: all-on , and all-off, and few ports that always enable STR_UNICODE

Additional tests that with board variations that "mix things up", and run the test suite against these would guard against regressions in individual feature enablement.

See unicode_testing for initial experiments Based on top of this PR.
No additional regressions were found - but several tests are quite naive in their assumption on feature availability.
porting tests to unittest does help - but some

Unicode test matrix

🪄AI Research:

Config switches touched / referenced

Option	Status in PR	Default
`MICROPY_PY_BUILTINS_STR_UNICODE`	existing, code paths heavily exercised	ON at ROM_LEVEL ≥ `EXTRA_FEATURES`
`MICROPY_PY_BUILTINS_STR_UNICODE_CHECK`	existing, used in new validation paths in objstr.c	follows `STR_UNICODE`
`MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS`	new in this PR (mpconfig.h)	ON at ROM_LEVEL ≥ `EXTRA_FEATURES`

All three currently track the ROM level, so in practice they're either all‑on or all‑off on any given build — no port flips them independently.

Where each state is built and tested in CI

Mapping ci.sh + .github/workflows/*.yml against the per‑port MICROPY_CONFIG_ROM_LEVEL:

CI job	Variant / board	STR_UNICODE / CHECK / DECODE_ERRORS	Tests executed?
`ports_unix.minimal`	unix `minimal` (`MINIMUM`)	OFF / OFF / OFF	yes (reduced suite)
`ports_unix.standard` / `standard_v2`	unix `standard` (`EXTRA`)	ON / ON / ON	yes (full)
`ports_unix.coverage` / `coverage_32bit`	unix `coverage` (`EVERYTHING`)	ON / ON / ON	yes (full)
`ports_unix.nanbox` / `longlong` / `float` / `float_clang` / `gil_enabled` / `stackless_clang` / `settrace_stackless` / `repr_b`	unix `EXTRA`/`EXTRA`+overrides	ON / ON / ON	yes
`ports_unix.qemu_arm` / `qemu_mips` / `qemu_riscv64`	unix `standard` cross	ON / ON / ON	yes
`ports_zephyr` (qemu_cortex_m3)	zephyr default (`BASIC_FEATURES`)	OFF / OFF / OFF	yes
`ports_qemu` (Cortex‑M, RV32, RV64, sabrelite, bigendian)	qemu (`EXTRA`)	ON / ON / ON	yes
`ports_webassembly`	pyscript (`FULL_FEATURES`)	ON / ON / ON	yes
`ports_esp32` (S3/C3/C2/C5/C6/P4/S2 spiram)	`EXTRA`	ON / ON / ON	build‑only
`ports_esp8266`	`EXTRA`	ON / ON / ON	build‑only
`ports_rp2`	`EXTRA`	ON / ON / ON	build‑only
`ports_stm32` (pyb / nucleo / misc, incl. `B_L072Z_LRWAN1` `CORE_FEATURES`)	mixed	mostly ON, `B_L072Z_LRWAN1`=OFF	build‑only
`ports_mimxrt`	`FULL_FEATURES`	ON	build‑only
`ports_samd`	samd21=`BASIC` / samd51=`FULL`	OFF / ON	build‑only
`ports_alif`, `ports_renesas-ra`, `ports_nrf`, `ports_cc3200`, `ports_powerpc`, `ports_windows`	mixed (nrf/cc3200/windows force `STR_UNICODE=1` regardless of ROM level)	varies	build‑only

run-tests.py auto‑detects unicode from the running interpreter (detect_test_platform, run-tests.py) and only walks unicode when args.unicode is true (run-tests.py). The new bytes_decode_ignore.py / bytes_decode_replace.py self‑SKIP when BYTES_DECODE_ERRORS=0.

What the suite actually exercises

All‑ON path (STR_UNICODE=1, STR_UNICODE_CHECK=1, BYTES_DECODE_ERRORS=1): well covered — unix standard / coverage / nanbox / longlong / float / repr_b, all qemu jobs, webassembly, plus build‑only assurance on every hardware port at EXTRA/FULL. The new tests (exception_invalid_utf8.py, str_center.py, unicode_char_format.py, bytes_decode_*, cpydiff/types_bytes_decode_*) execute here.
All‑OFF path: covered by ports_unix.minimal and ports_zephyr (BASIC). Both run the main suite, and the new bytes/unicode tests skip cleanly. samd21, bare-arm, B_L072Z_LRWAN1, and minimal get build‑only coverage of the OFF code paths.
Mixed permutations: not tested anywhere. There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

Suggested CI improvements

Add a "unicode permutations" build job — cheapest, highest value. In ci.sh add one or two unix variants (or just CFLAGS_EXTRA= overrides on the existing standard build) that compile with:
- -DMICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS=0 (with STR_UNICODE=1) — proves the new option's OFF branch still compiles and the bytes_decode_* self‑skip works.
- -DMICROPY_PY_BUILTINS_STR_UNICODE_CHECK=0 (with STR_UNICODE=1) — proves the validation guards in objstr.c (decode/encode/format) still build and behave correctly when checks are disabled.
- Optionally -DMICROPY_PY_BUILTINS_STR_UNICODE=1 -DMICROPY_PY_BUILTINS_STR_UNICODE_CHECK=1 -DMICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS=1 on top of minimal to verify forced‑ON on a low ROM level.
  These can be quick build‑and‑run jobs reusing the existing unix Makefile, no new variant directory required.
Run the test suite on at least one BASIC/MINIMUM hardware build — currently only zephyr-on-qemu does. Adding a qemu samd21 or similar would harden the OFF path against regressions on real microcontroller layouts (different mp_int_t, alignment, OBJ_REPR).
Verify the unix_minimal test selection actually exercises the new tests as SKIP. The minimal run today uses the basics subset; a one‑line check that bytes_decode_ignore.py is included (or explicitly add it under unix_minimal_run_tests) ensures the OFF branch of the new option doesn't silently rot.
Document the matrix in unicode_support.rst (already added by this PR) — adding a small "tested combinations" table makes the intent of STR_UNICODE_CHECK vs BYTES_DECODE_ERRORS decoupling explicit, and gives future contributors guidance when introducing further unicode‑related code paths.
Long‑term: the three flags being independent in the source but coupled in defaults is a maintenance hazard. Either (a) collapse STR_UNICODE_CHECK into STR_UNICODE if no port ever wants them split, or (b) keep them split and lock in the matrix with the build jobs from try/except leads to infinite loop with growing memory usage #1 so the independence is real.

The minimum useful addition is item #1 — a few extra unix builds (~1‑2 minutes of CI each) cover the previously‑untested permutations introduced or affected by this PR.

Josverl · 2026-05-04T15:50:19Z

Now another question is if the feature's rom level should be aligned to PR #19168, and be enabled by default following Unicode, with the ability to override to allow fine tuning.

This commit changes the default level at which unicode is enabled from "extra" down to "basic".

projectgus · 2026-05-06T07:04:46Z

There is no CI job that builds with e.g. STR_UNICODE=1 + BYTES_DECODE_ERRORS=0, or STR_UNICODE=1 + STR_UNICODE_CHECK=0. Because the new BYTES_DECODE_ERRORS defaults to the ROM level, the only way to hit 1/1/0 or 1/0/0 today is a manual CFLAGS override.

Actually I think this is probably OK the way it is now, especially now that ignore+replace are enabled as one flag. Most builds will only use the ROM level, and the niche combinations here are unusual enough that they're probably not worth adding more complexity for.

Will leave for @dpgeorge to take another look from here.

dpgeorge · 2026-05-13T00:48:57Z

@Josverl please can you rebase this on latest master to resolve the conflict.

Added feature detection at the start of bytes_decode_errors.py test to skip gracefully when decode method is not available. (requires MICROPY_CPYTHON_COMPAT). This fixes test failures on minimal builds and Windows builds that may not have this feature enabled. Test now: - Checks if decode method exists before running tests - Prints "SKIP" and exits cleanly if decode is not available - Works correctly on both full-featured and minimal builds Verified: - Standard unix build: All tests pass (14 testcases) - Minimal unix build: Test skips cleanly - All bytes/bytearray/string tests pass (82 tests, 2191 testcases) Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>