Improving Unicode support in MicroPython.#18854
Conversation
|
Code size report: |
|
@Josverl - the only changes I see are in the test suite? If you have a fork elsewhere, I can test any updated mqtt code. Over the weekend, I figured out that coercing the parameters in umqtt.publish to bytestrings works around the issue - so that if you have a fix for unicode handling, that's likely the right direction. Looking at the umqtt library, it looked like type validations in the methods would highlight the type mismatches and throw better errors, or .... |
|
@tpbrisco ,
Adding static typing can help spot errors, but that does still assume that the runtime can handle utf-8 |
|
Strangely coverage is failing on a test that was only introduced after the base of this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18854 +/- ##
=======================================
Coverage 98.47% 98.48%
=======================================
Files 176 176
Lines 22845 22900 +55
=======================================
+ Hits 22497 22553 +56
+ Misses 348 347 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
f2bbd3f to
39688d1
Compare
projectgus
left a comment
There was a problem hiding this comment.
This is really impressive, @Josverl, and I'm also impressed at the amount of test coverage added. It would be great to have this additional unicode support in MicroPython.
I have a few relatively minor comments inline, just small things.
The one major question I have is about build/test coverage - how many permutations of the new config items are we building & testing now, and is that enough?
ed4c42c to
fb2ddb1
Compare
😁 thanks, TDD really does work if the problem is clear. - However now I really can't let that last uncovered line slip by, so adding one more commit.
TLDR;Looking at the following 3 most relevant ( I think) options: So effectively the current test suite only tests: all-on , and all-off, and few ports that always enable Additional tests that with board variations that "mix things up", and run the test suite against these would guard against regressions in individual feature enablement. See unicode_testing for initial experiments Based on top of this PR. Unicode test matrix
🪄AI Research: Config switches touched / referenced
All three currently track the ROM level, so in practice they're either all‑on or all‑off on any given build — no port flips them independently. Where each state is built and tested in CIMapping ci.sh +
run-tests.py auto‑detects What the suite actually exercises
Suggested CI improvements
The minimum useful addition is item #1 — a few extra unix builds (~1‑2 minutes of CI each) cover the previously‑untested permutations introduced or affected by this PR. |
|
Now another question is if the feature's rom level should be aligned to PR #19168, and be enabled by default following Unicode, with the ability to override to allow fine tuning.
|
Actually I think this is probably OK the way it is now, especially now that ignore+replace are enabled as one flag. Most builds will only use the ROM level, and the niche combinations here are unusual enough that they're probably not worth adding more complexity for. Will leave for @dpgeorge to take another look from here. |
|
@Josverl please can you rebase this on latest master to resolve the conflict. |
Added feature detection at the start of bytes_decode_errors.py test to skip gracefully when decode method is not available. (requires MICROPY_CPYTHON_COMPAT). This fixes test failures on minimal builds and Windows builds that may not have this feature enabled. Test now: - Checks if decode method exists before running tests - Prints "SKIP" and exits cleanly if decode is not available - Works correctly on both full-featured and minimal builds Verified: - Standard unix build: All tests pass (14 testcases) - Minimal unix build: Test skips cleanly - All bytes/bytearray/string tests pass (82 tests, 2191 testcases) Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
|
@Josverl I'm working my way through this PR. I really think the changes that fix #17855 (all changes to |
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
dpgeorge
left a comment
There was a problem hiding this comment.
Thanks for updating. I've made some more suggestions on ways to reduce code size.
Only accepts `utf-8`, `utf8` or `ascii` Fixes micropython#15849 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Fixes: issue 3364 Fixes: issue 13084 Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Fixes Issue 17827 Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Prevent the test from failing by not testing known unsupported characters. These will be documented in a cpydiff test. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
This allows simpler skipping of tests based on enabled capabilities. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Removed the dead \U%08x branch in uni_print_quoted. Characters ≥ 0x110000 are impossible in valid UTF-8, so the branch was unreachable. It's replaced by a single else that handles surrogates (0xD800–0xDFFF) with \u%04x. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Added multi-byte sequences to improve test coverage. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
+ Correct a few typos in comments. Signed-off-by: Jos Verlinde <jos_verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
Signed-off-by: Jos Verlinde <Jos_Verlinde@hotmail.com>
|
Summary
Improving Unicode support in MicroPython is not just a technical improvement—it's a commitment to global accessibility, educational equity, and inclusive computing. With a limited memory cost of ~0.05%, this enhancement removes barriers for users worldwide and aligns MicroPython with CPython compatibility goals while staying true to its mission of bringing Python to everyone, everywhere. PEP 3131 (2007) established Unicode identifiers as a Python standard, UTF-8 is the universal encoding for modern software development, and MicroPython should support these standards where possible.
This pull request addresses multiple issues related to the handling of Unicode characters and encoding in MicroPython.
bytes.decode()method to ensure only valid encodings are accepted, specificallyutf-8,utf8, andascii.bytes.decode(), fixing issues where invalid encodings did not raise appropriate exceptions. The API is now CPython compatible, only not accepting kwargs.str.center()method to accurately count Unicode characters instead of bytes, ensuring proper padding for multi-byte characters.Fixes: #15849
Fixes: #3364
Fixes: #17827
Related PRs:
This is a re-submit of #18670 which was unrecoverably closed due to user error.
Testing
Testing has been conducted across various platforms, including ESP32, RP2, and Unix , with all relevant tests passing successfully. This includes new tests for encoding validation, error handling, and Unicode character formatting.
As Unicode offers a very large set of codepoints I have based the testing on Unicode test data in 127 languages and script combinations that are available in unicode_mpy
Using this test set has allowed me to find additional issues that were not yet reported.
The new tests that have been added to the MicroPython test suite have been based on the examples provided in issues, and on issues found though this test set.
Trade-offs and Alternatives
bytes.decode()does not accept kwargs to reduce firmware size.str.center()method does currently not currently handle the width correctly for multi-byte Unicode characters.These issues can be addressed in future PRs.