Skip to content

[mypyc] Add librt.strings.isidentifier codepoint primitive#21522

Open
VaggelisD wants to merge 3 commits into
python:masterfrom
VaggelisD:librt-strings-isidentifier
Open

[mypyc] Add librt.strings.isidentifier codepoint primitive#21522
VaggelisD wants to merge 3 commits into
python:masterfrom
VaggelisD:librt-strings-isidentifier

Conversation

@VaggelisD
Copy link
Copy Markdown
Contributor

5th PR of #21418

True if a codepoint can start a valid identifier (XID_Start, per
PEP 3131). The ASCII fast path covers `[A-Za-z_]` inline; non-ASCII
codepoints round-trip through PyUnicode_FromOrdinal +
PyUnicode_IsIdentifier so the answer matches str.isidentifier on a
1-character string.

The non-ASCII path is the first allocating helper in this series, so
its body lives out-of-line in codepoint_extra_ops.c (it would otherwise
be emitted as a separate copy in every translation unit that includes
the header). On OOM it swallows the exception via PyErr_Clear() and
returns False, which keeps the function ERR_NEVER. Documented at the
call site so callers don't get a surprising silent failure.

Stack: depends on the librt.strings.isspace primitive.
@VaggelisD VaggelisD force-pushed the librt-strings-isidentifier branch from add1d44 to 283b2f8 Compare May 21, 2026 07:42
@github-actions

This comment has been minimized.

Comment thread mypyc/lib-rt/codepoint_extra_ops.c Outdated
Comment on lines +11 to +16
if (s == NULL) {
// OOM. Swallow and return false to keep the function ERR_NEVER;
// callers expect a defined answer, not a propagated exception.
PyErr_Clear();
return false;
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually call CPyError_OutOfMemory() in this case to abort, i think it would make sense here as well.

Comment thread mypyc/build.py Outdated
Comment on lines +57 to +62
ModDesc(
"librt.strings",
["strings/librt_strings.c", "codepoint_extra_ops.c"],
["codepoint_extra_ops.h"],
["strings"],
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not catching this earlier but it looks like the files are dependent on each other the wrong way. the _extra_ops files should be internal to mypyc and shouldn't be needed when building librt modules.

so instead of including codepoint_extra_ops in librt_strings we need it the other way around, with the common functionality in librt_strings. like it's done with the other _extra_ops files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, that's on me first!

Copy link
Copy Markdown
Collaborator

@p-sawicki p-sawicki May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i see there's still #include "codepoint_extra_ops.h" in librt_strings.c and it would be good to clean that up by moving the functions from there to librt_strings.h.

that would make the codepoint_extra_ops files empty so maybe they aren't actually needed and we could just compile the functions statically straight from librt_strings.h? unless you're planning on adding functions later that would be better placed outside of librt_strings.h.

Copy link
Copy Markdown
Contributor Author

@VaggelisD VaggelisD May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on using it to add functions like toupper(), tolower() etc so that they're separated from librt_strings, what do you think? Worth the complexity?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be separate implementations used by the python wrapper in librt_strings and when compiled statically with mypyc? in that case sure, i think it'd make sense to have them in codepoint_extra_ops.

but if they're more similar to the functions added so far where the python wrapper calls the same function that mypyc could generate in the C file then they would have to be in librt_strings. it would be cleaner to keep the separation between the librt modules and C files relevant only in mypyc generated code by not including them by the librt module files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, they'd be the same function afaict. Do we worry about the librt capsule indirection overhead? I assume these codepoint primitives will participate in hot loops so preferably they'd be inlined instead of going through the librt fp array (?)

I see the overall point though, there's no mypyc callers so there's no real point in having these outside of librt_strings.

- codepoint_extra_ops.h: include CPy.h and move the isidentifier slow
  path inline into LibRTStrings_IsIdentifier. Aborts via
  CPyError_OutOfMemory on allocation failure (the helper is ERR_NEVER,
  so returning a silently-wrong bool under memory pressure was the wrong
  contract). Matches the pattern in the sibling _extra_ops.h files (all
  bodies static inline, CPy.h included for runtime helpers).
- codepoint_extra_ops.c: reduce to a single-line shim that #includes the
  header. Exists only so SourceDep("codepoint_extra_ops.c") pulls the
  header into user mypyc extensions in include_runtime_files mode.
- build.py / lib-rt/setup.py: drop codepoint_extra_ops.c from the
  librt.strings module sources. The _extra_ops.c files are mypyc-internal
  (linked into user extensions via SourceDep in mypyc/ir/deps.py); the
  librt.strings Python module shouldn't depend on them, matching how
  bytes_extra_ops, str_extra_ops, etc. are organized. librt.strings now
  picks up LibRTStrings_IsIdentifier via #include of the header.
@github-actions

This comment has been minimized.

Per follow-up review on python#21522, the codepoint classifiers belong with
the rest of the librt.strings public surface rather than in a shared
inline header, since they implement Python-visible librt.strings
functions (not mypyc-internal codegen helpers like the other _extra_ops
files). Move them out of codepoint_extra_ops.h and into librt_strings.c
as proper non-static functions, exposed to mypyc-compiled callers via
the capsule API the same way every other LibRTStrings_* function works.

This keeps the librt module files independent of mypyc-internal
_extra_ops headers, matching the pattern used by BytesWriter_internal
etc. The cost is one indirect call per use vs. the previous inlined
macro; still substantially faster than the Python method dispatch the
primitives are replacing.

- librt_strings.h: bump LIBRT_STRINGS_API_VERSION 4->5, LIBRT_STRINGS_API_LEN
  14->19.
- librt_strings_api.h: add 5 macro entries for LibRTStrings_API[14..18].
- librt_strings.c: define the 5 helpers; register them in the capsule
  table; drop `#include "codepoint_extra_ops.h"`.
- mypyc/ir/deps.py: delete CODEPOINT_EXTRA_OPS.
- mypyc/primitives/librt_strings_ops.py: drop the CODEPOINT_EXTRA_OPS
  dep from the five codepoint primitives.
- Delete codepoint_extra_ops.{c,h}.
@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants