[mypyc] Add librt.strings.isidentifier codepoint primitive#21522
[mypyc] Add librt.strings.isidentifier codepoint primitive#21522VaggelisD wants to merge 3 commits into
librt.strings.isidentifier codepoint primitive#21522Conversation
True if a codepoint can start a valid identifier (XID_Start, per PEP 3131). The ASCII fast path covers `[A-Za-z_]` inline; non-ASCII codepoints round-trip through PyUnicode_FromOrdinal + PyUnicode_IsIdentifier so the answer matches str.isidentifier on a 1-character string. The non-ASCII path is the first allocating helper in this series, so its body lives out-of-line in codepoint_extra_ops.c (it would otherwise be emitted as a separate copy in every translation unit that includes the header). On OOM it swallows the exception via PyErr_Clear() and returns False, which keeps the function ERR_NEVER. Documented at the call site so callers don't get a surprising silent failure. Stack: depends on the librt.strings.isspace primitive.
add1d44 to
283b2f8
Compare
This comment has been minimized.
This comment has been minimized.
| if (s == NULL) { | ||
| // OOM. Swallow and return false to keep the function ERR_NEVER; | ||
| // callers expect a defined answer, not a propagated exception. | ||
| PyErr_Clear(); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
we usually call CPyError_OutOfMemory() in this case to abort, i think it would make sense here as well.
| ModDesc( | ||
| "librt.strings", | ||
| ["strings/librt_strings.c", "codepoint_extra_ops.c"], | ||
| ["codepoint_extra_ops.h"], | ||
| ["strings"], | ||
| ), |
There was a problem hiding this comment.
sorry for not catching this earlier but it looks like the files are dependent on each other the wrong way. the _extra_ops files should be internal to mypyc and shouldn't be needed when building librt modules.
so instead of including codepoint_extra_ops in librt_strings we need it the other way around, with the common functionality in librt_strings. like it's done with the other _extra_ops files.
There was a problem hiding this comment.
No worries, that's on me first!
There was a problem hiding this comment.
thanks! i see there's still #include "codepoint_extra_ops.h" in librt_strings.c and it would be good to clean that up by moving the functions from there to librt_strings.h.
that would make the codepoint_extra_ops files empty so maybe they aren't actually needed and we could just compile the functions statically straight from librt_strings.h? unless you're planning on adding functions later that would be better placed outside of librt_strings.h.
There was a problem hiding this comment.
I was planning on using it to add functions like toupper(), tolower() etc so that they're separated from librt_strings, what do you think? Worth the complexity?
There was a problem hiding this comment.
would there be separate implementations used by the python wrapper in librt_strings and when compiled statically with mypyc? in that case sure, i think it'd make sense to have them in codepoint_extra_ops.
but if they're more similar to the functions added so far where the python wrapper calls the same function that mypyc could generate in the C file then they would have to be in librt_strings. it would be cleaner to keep the separation between the librt modules and C files relevant only in mypyc generated code by not including them by the librt module files.
There was a problem hiding this comment.
I don't think so, they'd be the same function afaict. Do we worry about the librt capsule indirection overhead? I assume these codepoint primitives will participate in hot loops so preferably they'd be inlined instead of going through the librt fp array (?)
I see the overall point though, there's no mypyc callers so there's no real point in having these outside of librt_strings.
- codepoint_extra_ops.h: include CPy.h and move the isidentifier slow
path inline into LibRTStrings_IsIdentifier. Aborts via
CPyError_OutOfMemory on allocation failure (the helper is ERR_NEVER,
so returning a silently-wrong bool under memory pressure was the wrong
contract). Matches the pattern in the sibling _extra_ops.h files (all
bodies static inline, CPy.h included for runtime helpers).
- codepoint_extra_ops.c: reduce to a single-line shim that #includes the
header. Exists only so SourceDep("codepoint_extra_ops.c") pulls the
header into user mypyc extensions in include_runtime_files mode.
- build.py / lib-rt/setup.py: drop codepoint_extra_ops.c from the
librt.strings module sources. The _extra_ops.c files are mypyc-internal
(linked into user extensions via SourceDep in mypyc/ir/deps.py); the
librt.strings Python module shouldn't depend on them, matching how
bytes_extra_ops, str_extra_ops, etc. are organized. librt.strings now
picks up LibRTStrings_IsIdentifier via #include of the header.
This comment has been minimized.
This comment has been minimized.
Per follow-up review on python#21522, the codepoint classifiers belong with the rest of the librt.strings public surface rather than in a shared inline header, since they implement Python-visible librt.strings functions (not mypyc-internal codegen helpers like the other _extra_ops files). Move them out of codepoint_extra_ops.h and into librt_strings.c as proper non-static functions, exposed to mypyc-compiled callers via the capsule API the same way every other LibRTStrings_* function works. This keeps the librt module files independent of mypyc-internal _extra_ops headers, matching the pattern used by BytesWriter_internal etc. The cost is one indirect call per use vs. the previous inlined macro; still substantially faster than the Python method dispatch the primitives are replacing. - librt_strings.h: bump LIBRT_STRINGS_API_VERSION 4->5, LIBRT_STRINGS_API_LEN 14->19. - librt_strings_api.h: add 5 macro entries for LibRTStrings_API[14..18]. - librt_strings.c: define the 5 helpers; register them in the capsule table; drop `#include "codepoint_extra_ops.h"`. - mypyc/ir/deps.py: delete CODEPOINT_EXTRA_OPS. - mypyc/primitives/librt_strings_ops.py: drop the CODEPOINT_EXTRA_OPS dep from the five codepoint primitives. - Delete codepoint_extra_ops.{c,h}.
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
5th PR of #21418