Regular expressions shipped as a .wasm plugin. Scripts see re as ordinary.
from re import search, sub, findall
print(search(r'(\d+)-(\d+)', 'order 12-34 shipped')) # 12-34
print(sub(r'\s+', '_', 'a b c')) # a_b_c
print(findall(r'\w+', 'one two three')) # ['one', 'two', 'three']A small backtracking engine, around 1k lines of Rust, no automaton or bytecode. It is Unicode aware without shipping Unicode tables: \d, \w, \s, and (?i) evaluate through the standard library char predicates at match time. The match runs over codepoints, so every offset is a codepoint index that lines up with how the runtime indexes strings.
The tradeoff is the backtracking tradeoff: no linear time guarantee. Instead of hanging, a step budget proportional to the input length watches for degradation. When a pattern blows the budget the call raises naming catastrophic backtracking, so the author can rewrite it.
All functions take the pattern first. Flags are written inline, (?i) ignorecase, (?s) dotall, (?m) multiline.
| Function | Returns |
|---|---|
match(pattern, string) |
matched text anchored at the start, or None |
search(pattern, string) |
first matched text anywhere, or None |
fullmatch(pattern, string) |
matched text if it spans the whole string, or None |
findall(pattern, string) |
list of matches, or list of groups when the pattern has groups |
groups(pattern, string) |
capture groups of the first match as a list, or None |
span(pattern, string) |
[start, end] codepoint offsets of the first match, or None |
sub(pattern, repl, string) |
string with every match replaced, \1 and \g<name> expand groups |
print(groups(r'(\w+)@(\w+)', 'a@b')) # ['a', 'b']
print(span(r'\d+', 'áé123')) # [2, 5], codepoint offsetsPatterns are ordinary strings. Prefer raw strings (r'...') so backslashes reach the engine intact.
Characters: . any char (also newline under (?s)); \d \w \s digit/word/whitespace (Unicode aware) and negations \D \W \S; [abc] [a-z] set or range; [^abc] negated set; \n \t \xhh \uhhhh escapes.
Anchors: ^ $ start and end (per line under (?m)); \b \B word boundary and non-boundary.
Quantifiers: * + ?; {m} {m,n} {m,}; trailing ? for the lazy form (*?, +?).
Groups and references: (...) capturing; (?:...) non-capturing; (?P<name>...) named; \1 (?P=name) backreference; a|b alternation.
Lookaround and flags: (?=...) (?!...) lookahead; (?<=...) (?<!...) lookbehind (fixed width only); (?i) (?s) (?m) inline flags.
Unicode property classes \p{...}, named character escapes \N{...}, atomic groups, possessive quantifiers, conditional patterns, scoped inline flags (?i:...), variable width lookbehind. \d follows the Unicode numeric property, slightly wider than the CPython decimal set. Invalid patterns raise ValueError.
A step budget proportional to the input watches the work; when a pattern explodes the budget the engine raises instead of hanging:
error: catastrophic backtracking: O(n^2) time or worse on this input, simplify nested quantifiers
--> <input>:1:8
|
1 | search(r'(a+)+$', user_input)
| ^^^^^^^^^
The nested (a+)+ lets the engine split a run of a countless ways before failing at $; the flat form a+$ matches the same text in linear time. This raises RuntimeError, not ValueError: a malformed pattern is a value problem, but a valid pattern that exhausts its budget is a runtime one (the same split engines like .NET draw with RegexMatchTimeoutException).
Compiles to wasm32-unknown-unknown (cdylib) against the wasm-pdk v0.1.0 ABI. match is a hand-written export (it is a Rust keyword); the rest go through #[plugin_fn]. Results cross the handle ABI as primitives and lists; plugin memory recycles per call through a static pool, like json.
MIT OR Apache-2.0