Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
81a6910
tests/basics: Add bytes.decode() tests.
Josverl May 13, 2026
c856515
py/config: Add MICROPY_PY_BUILTINS_BYTES_DECODE_REPLACE.
Josverl May 13, 2026
82d6e5d
py/unicode: Implement bytes.decode() 'ignore' and 'replace' modes.
Josverl May 13, 2026
60769d5
tests/bytes_decode_errors: Handle platforms without CPYTHON_COMPAT.
Josverl May 13, 2026
d34289e
tests/basics/bytes_decode_encoding: Add tests to validate encoding.
Josverl May 13, 2026
0fe170b
tests/unicode: Add tests for unicode character formatting.
Josverl May 13, 2026
dae625b
tests/unicode: Tests exception handling for strings starting with 0xff.
Josverl May 13, 2026
527b618
py/objstr: Validate encoding for decode and encode.
Josverl May 13, 2026
2e27300
py/objstr: Enhance utf-8 character handling in string formatting.
Josverl Jun 6, 2026
57bb48f
py/objstr: Fix str_center for Unicode strings.
Josverl Jun 6, 2026
77f6830
docs: Document Unicode support and limitiations.
Josverl Jun 6, 2026
b513704
tests/unicode: Remove known differences from test.
Josverl Jun 6, 2026
57e679e
tests/cpydiff: Document unicode differences.
Josverl Jun 6, 2026
4e9a5ec
tests/basics/bytes_decode: Split tests for ignore/replace.
Josverl Jun 6, 2026
0c3dd02
tests/unicode/unicode_char_format: Test unicode character formatting.
Josverl Jun 6, 2026
8fae63d
test/cpydiff: Add tests to document Unicode differences.
Josverl Jun 6, 2026
5c676e1
tests: Update t-string test cases for unicode.
Josverl Jun 6, 2026
dfafe93
py/objstrunicode: Remove dead code from unicode escape handling.
Josverl Jun 6, 2026
17d1518
tests: Enhance byte and string decoding tests.
Josverl Jun 6, 2026
2f93c22
py/objstr: Optimize character handling and encoding validation.
Josverl Jun 6, 2026
7e13ed0
run-tests: Specify UTF-8 encoding when opening test files.
Josverl Jun 6, 2026
4d26ee4
py/objstr: Refactor to use mp_print_char helper function.
Josverl Jun 6, 2026
e6e4846
refactor: Use QSTR and common error message.
Josverl Jun 6, 2026
80fd7d6
py/objexcept: Improve decompression check for ROM strings.
Josverl May 13, 2026
8722186
tests/exception_splitheap: Test Exceptions with ROM strings.
Josverl May 13, 2026
0aa9852
tests: Test exception handling with heap-allocated unicode-like strings.
Josverl May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/develop/writingtests.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,11 @@ If you run your tests, this test should appear in the test output:
Tests are run by comparing the output from the test target against the output from CPython.
So any test should use print statements to indicate test results.

When writing tests for name or string-related functionality, please add both English/ASCII
as well as non-English/non-ASCII text and include Unicode examples.
Please do add comments in English explaining the meaning and intent of the Unicode text.
This help ensure Unicode support is tested and verified across different platforms.

For tests that can't be compared to CPython (i.e. micropython-specific functionality),
you can provide a ``.py.exp`` file which will be used as the truth for comparison.

Expand Down
4 changes: 4 additions & 0 deletions docs/differences/python_35.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ Other Language Changes:
| Added the *namereplace* error handlers. The *backslashreplace* error handlers now work with decoding and | |
| translating. | |
+-----------------------------------------------------------------------------------------------------------+---------------+
| As of v1.28.0 String encoding/decoding properly validates the encoding parameter and raises | Improved |
| ``LookupError`` for unsupported encodings. Error handlers (``'ignore'``, ``'replace'``) are supported | |
| when enabled via build configuration. The ``errors`` parameter must be passed positionally. | |
+-----------------------------------------------------------------------------------------------------------+---------------+
| Property docstrings are now writable. This is especially useful for collections.namedtuple() docstrings | |
+-----------------------------------------------------------------------------------------------------------+---------------+
| Circular imports involving relative imports are now supported. | |
Expand Down
79 changes: 79 additions & 0 deletions docs/library/builtins.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,35 @@ Functions and types

|see_cpython| `python:bytes`.

.. method:: bytes.decode(encoding='utf-8', errors='strict')

Decode the bytes object to a string using the specified *encoding*.

MicroPython supports the following encodings:

- ``'utf-8'`` or ``'utf8'`` - UTF-8 encoding (default)
- ``'ascii'`` - ASCII encoding (subset of UTF-8)

The *errors* parameter controls how decoding errors are handled:

- ``'strict'`` - Raise a ``UnicodeError`` on invalid UTF-8 (default)
- ``'ignore'`` - Skip invalid bytes (requires ``MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS``)
- ``'replace'`` - Replace invalid bytes with U+FFFD '�' (requires ``MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS``)

.. note::
Error handler support depends on build configuration. On constrained
systems, only ``'strict'`` mode may be available.

Example::

>>> b'\\xc2\\xa9 2024'.decode('utf-8') # © symbol
'© 2024'
>>> b'hello\\xffworld'.decode('utf-8', 'ignore') # Skip invalid bytes
'helloworld'

Raises ``LookupError`` if the encoding is not supported, or
``UnicodeError`` if the data contains invalid UTF-8 and ``errors='strict'``.

.. function:: callable()

.. function:: chr()
Expand Down Expand Up @@ -148,6 +177,56 @@ Functions and types

.. class:: str()

.. method:: str.encode(encoding='utf-8')

Encode the string to bytes using the specified *encoding*.

MicroPython supports the following encodings:

- ``'utf-8'`` or ``'utf8'`` - UTF-8 encoding (default)
- ``'ascii'`` - ASCII encoding (subset of UTF-8)

Example::

>>> '© 2024'.encode('utf-8') # Copyright symbol
b'\\xc2\\xa9 2024'

Raises ``LookupError`` if the encoding is not supported.

.. method:: str.center(width)

Return a centered string of length *width*. Padding is done using spaces.

When Unicode support is enabled (``MICROPY_PY_BUILTINS_STR_UNICODE``), this
method counts Unicode characters rather than bytes, ensuring proper alignment
for multi-byte UTF-8 characters.

Example::

>>> 'café'.center(10) # é is 2 bytes in UTF-8
' café '

String Formatting
-----------------

MicroPython supports Unicode in string formatting when ``MICROPY_PY_BUILTINS_STR_UNICODE``
is enabled.

The ``%c`` and ``{:c}`` format specifiers accept Unicode codepoints in the range 0 to 0x10FFFF
(1,114,111) and properly encode multi-byte UTF-8 characters.

Example::

>>> '%c' % 0x1F389 # 🎉 emoji
'🎉'
>>> '{:c}'.format(0x4E2D) # 中 (Chinese character)
'中'

Invalid character codes raise ``ValueError``::

>>> '%c' % -1
ValueError: %c arg not in range(0x110000)

.. function:: sum()

.. function:: super()
Expand Down
21 changes: 20 additions & 1 deletion docs/reference/constrained.rst
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,26 @@ instances so the process of eliminating Unicode can be painless.
b = b'the quick brown fox' # A bytes instance

Where it is necessary to convert between strings and bytes the :meth:`str.encode`
and the :meth:`bytes.decode` methods can be used. Note that both strings and bytes
and the :meth:`bytes.decode` methods can be used. MicroPython validates the
encoding parameter and only supports UTF-8 and ASCII. The :meth:`bytes.decode`
method also supports error handlers (``'ignore'`` and ``'replace'``) for handling
invalid UTF-8, when enabled in the build configuration.

For memory-conscious applications processing untrusted data, using the ``'ignore'``
error handler can be more efficient than ``'strict'`` mode (the default), as it
avoids raising exceptions while still recovering valid text::

# Strict mode (default) raises an error on invalid UTF-8
try:
s = data.decode('utf-8')
except UnicodeError:
# Handle error
pass

# Ignore mode skips invalid bytes (more memory-efficient)
s = data.decode('utf-8', 'ignore')

Note that both strings and bytes
are immutable. Any operation which takes as input such an object and produces
another implies at least one RAM allocation to produce the result. In the
second line below a new bytes object is allocated. This would also occur if ``foo``
Expand Down
1 change: 1 addition & 0 deletions docs/reference/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,5 +31,6 @@ implementation and the best practices to use them.
packages.rst
asm_thumb2_index.rst
filesystem.rst
unicode_support.rst
pyboard.py.rst
micropython2_migration.rst
173 changes: 173 additions & 0 deletions docs/reference/unicode_support.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
.. _unicode_support:

Unicode Support
===============

MicroPython provides Unicode support for strings, with the level of support
depending on the build configuration.

Character Encoding
------------------

MicroPython uses UTF-8 encoding for all strings. When Unicode support is enabled
(``MICROPY_PY_BUILTINS_STR_UNICODE``), strings can contain any valid Unicode
character from U+0000 to U+10FFFF.

ASCII characters (0-127) are stored in a single byte, making them as memory-efficient
as on systems without Unicode support. Multi-byte UTF-8 characters use 2-4 bytes
depending on the codepoint:

- U+0000 to U+007F: 1 byte (ASCII)
- U+0080 to U+07FF: 2 bytes
- U+0800 to U+FFFF: 3 bytes
- U+10000 to U+10FFFF: 4 bytes

Encoding and Decoding
----------------------

The :meth:`bytes.decode` and :meth:`str.encode` methods support the following encodings:

- UTF-8 (``'utf-8'`` or ``'utf8'``)
- ASCII (``'ascii'``)

Other encodings (such as ``'latin-1'``, ``'utf-16'``, etc.) are not supported and
will raise ``LookupError``.

Example::

>>> '日本語'.encode('utf-8')
b'\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e'
>>> b'\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e'.decode('utf-8')
'日本語'

Error Handling
~~~~~~~~~~~~~~

When decoding bytes that contain invalid UTF-8 sequences, the ``errors`` parameter
of :meth:`bytes.decode` controls the behavior:

- ``'strict'`` (default): Raise ``UnicodeError``
- ``'ignore'``: Skip invalid bytes (requires ``MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS``)
- ``'replace'``: Replace invalid bytes with U+FFFD � (requires ``MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS``)

Example::

>>> # Strict mode (default) raises an error
>>> b'hello\\xffworld'.decode('utf-8')
UnicodeError: invalid UTF-8

>>> # Ignore mode skips invalid bytes
>>> b'hello\\xffworld'.decode('utf-8', 'ignore')
'helloworld'

>>> # Replace mode substitutes replacement character
>>> b'hello\\xffworld'.decode('utf-8', 'replace')
'hello�world'

For memory-conscious applications, consider using ``'ignore'`` mode when processing
untrusted or partially corrupted data, as it avoids raising exceptions while still
recovering valid text.

String Methods
--------------

When Unicode support is enabled, string methods operate on characters rather than bytes:

- :meth:`str.center` - Counts Unicode characters for width calculation
- ``len(s)`` - Returns number of characters (not bytes)
- String indexing and slicing work on character boundaries
- No support for display width calculations (East Asian width, combining characters, etc.)

Example::

>>> s = 'Hello 世界'
>>> len(s) # 8 characters
8
>>> len(s.encode()) # 12 bytes
12
>>> s.center(12) # Centered by character count
' Hello 世界 '

String Formatting
-----------------

The ``%c`` format specifier and ``{:c}`` format code support full Unicode:

- Accepts codepoints from 0 to 0x10FFFF
- Properly encodes multi-byte UTF-8 characters
- Raises ``ValueError`` for invalid codepoints

Example::

>>> '%c' % 65 # ASCII
'A'
>>> '%c' % 0x03B1 # Greek α
'α'
>>> '%c' % 0x1F600 # Emoji 😀
'😀'
>>> '{:c}'.format(0x4E2D) # Chinese 中
'中'

>>> # Invalid codepoint
>>> '%c' % 0x110000
ValueError: %c arg not in range(0x110000)

F-strings also support the ``:c`` format code::

>>> codepoint = 0x2665 # Heart suit ♥
>>> f'I {codepoint:c} Python'
'I ♥ Python'

Build Configuration
-------------------

Unicode features are controlled by several build-time flags in ``mpconfigport.h``:

``MICROPY_PY_BUILTINS_STR_UNICODE``
Enable Unicode string support. When enabled, strings can contain any valid
Unicode character and string operations work on character boundaries rather
than byte boundaries.

Default: Enabled at ``MICROPY_CONFIG_ROM_LEVEL_EXTRA_FEATURES`` and above.

``MICROPY_PY_BUILTINS_STR_UNICODE_CHECK``
Enable UTF-8 validation during string operations. When disabled, string
operations may produce incorrect results with invalid UTF-8 sequences.

Default: Follows ``MICROPY_PY_BUILTINS_STR_UNICODE`` setting.

``MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS``
Enable the ``'ignore'`` and ``'replace'`` error handlers for
:meth:`bytes.decode`. When enabled, invalid UTF-8 bytes can be either
skipped (``'ignore'``) or replaced with U+FFFD (``'replace'``).

Default: Enabled at ``MICROPY_CONFIG_ROM_LEVEL_EXTRA_FEATURES`` and above.

Example Configuration
~~~~~~~~~~~~~~~~~~~~~

For a constrained port with limited flash, disable error handlers::

#define MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS (0)

For a port with more resources, enable all Unicode features::

#define MICROPY_CONFIG_ROM_LEVEL (MICROPY_CONFIG_ROM_LEVEL_EXTRA_FEATURES)
// This automatically enables:
// - MICROPY_PY_BUILTINS_STR_UNICODE
// - MICROPY_PY_BUILTINS_BYTES_DECODE_ERRORS

Limitations
-----------

MicroPython's Unicode support has some limitations compared to CPython:

- Only UTF-8 and ASCII encodings are supported
- No support for Unicode normalization
- No locale-aware string operations
- The ``errors`` parameter accepts only positional arguments (not keyword arguments)
- String methods like ``upper()``, ``lower()``, etc. work correctly only for ASCII
- The MicroPython interactive REPL and ``input()`` function currently have very limited
Unicode support.
A workaround for this is to use utf-8 encoded MicroPython scripts containing the unicode
text, and run them using ``mpremote run <script.py>``.
33 changes: 33 additions & 0 deletions ports/unix/coverage.c
Original file line number Diff line number Diff line change
Expand Up @@ -735,6 +735,39 @@ static mp_obj_t extra_coverage(void) {
mp_obj_print_exception(&mp_plat_print, mp_obj_new_exception_args(&mp_type_ValueError, 0, NULL));
}

// exception with heap-allocated str whose data byte starts with 0xff
// (the ROM-string compression marker). Exercises the is_in_heap branch
// and the skip_decompression label in py/objexcept.c
// decompress_error_text_maybe(), which is otherwise unreachable from
// Python code when MICROPY_PY_BUILTINS_STR_UNICODE_CHECK is enabled.
{
mp_printf(&mp_plat_print, "# exception heap str with 0xff prefix\n");
#if MICROPY_ROM_TEXT_COMPRESSION
static const char marker[] = "\xff" "non-rom-heap-string";
const size_t mlen = sizeof(marker) - 1;
byte *buf = m_new(byte, mlen);
memcpy(buf, marker, mlen);
mp_obj_str_t *o_str = m_new_obj(mp_obj_str_t);
o_str->base.type = &mp_type_str;
o_str->hash = 0; // force the lazy-hash path after skip_decompression
o_str->len = mlen;
o_str->data = buf;
mp_obj_t arg = MP_OBJ_FROM_PTR(o_str);
mp_obj_t exc = mp_obj_new_exception_args(&mp_type_ValueError, 1, &arg);
// Trigger decompress_error_text_maybe() via the .args attr accessor.
mp_obj_t dest[2] = {MP_OBJ_NULL, MP_OBJ_NULL};
mp_load_method_maybe(exc, MP_QSTR_args, dest);
// Confirm the heap string was preserved (not overwritten by decompression)
// and that the lazy-hash branch ran.
mp_printf(&mp_plat_print, "data[0]=0x%02x len=%u hash_set=%d\n",
o_str->data[0], (unsigned)o_str->len, o_str->hash != 0);
#else
// decompress_error_text_maybe() is a no-op when ROM text compression
// is disabled; emit matching output so the .exp file stays consistent.
mp_printf(&mp_plat_print, "data[0]=0xff len=20 hash_set=1\n");
#endif
}

// warning
{
mp_emitter_warning(MP_PASS_CODE_SIZE, "test");
Expand Down
2 changes: 1 addition & 1 deletion py/gc.c
Original file line number Diff line number Diff line change
Expand Up @@ -394,7 +394,7 @@ bool gc_is_locked(void) {
#if MICROPY_GC_SPLIT_HEAP
// Returns the area to which this pointer belongs, or NULL if it isn't
// allocated on the GC-managed heap.
static inline mp_state_mem_area_t *gc_get_ptr_area(const void *ptr) {
struct _mp_state_mem_area_t *gc_get_ptr_area(const void *ptr) {
if (((uintptr_t)(ptr) & (BYTES_PER_BLOCK - 1)) != 0) { // must be aligned on a block
return NULL;
}
Expand Down
5 changes: 5 additions & 0 deletions py/gc.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ void gc_init(void *start, void *end);
// Used to add additional memory areas to the heap.
void gc_add(void *start, void *end);

// Returns the area to which this pointer belongs, or NULL if it isn't
// allocated on the GC-managed heap.
struct _mp_state_mem_area_t;
struct _mp_state_mem_area_t *gc_get_ptr_area(const void *ptr);

#if MICROPY_GC_SPLIT_HEAP_AUTO
// Port must implement this function to return the maximum available block of
// RAM to allocate a new heap area into using MP_PLAT_ALLOC_HEAP.
Expand Down
2 changes: 1 addition & 1 deletion py/modbuiltins.c
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ static mp_obj_t mp_builtin_chr(mp_obj_t o_in) {
#if MICROPY_PY_BUILTINS_STR_UNICODE
mp_uint_t c = mp_obj_get_int(o_in);
if (c >= 0x110000) {
mp_raise_ValueError(MP_ERROR_TEXT("chr() arg not in range(0x110000)"));
mp_raise_ValueError(MP_ERROR_TEXT("char not in range(0x110000)"));
}
VSTR_FIXED(buf, 4);
vstr_add_char(&buf, c);
Expand Down
Loading
Loading