Overview
The release build is less than 200 KB on wasm32-unknown-unknown (panic=abort, opt-level=z, lto=true, codegen-units=1). The pipeline: LUT-driven lexer -> single-pass Pratt parser emitting SSA-versioned bytecode directly -> peephole constant-folding optimiser -> token-threaded interpreter with two layers of adaptive specialisation.
No AST, no IR: bytecode is the only intermediate representation. Around 13,000 lines of Rust; production deps are hashbrown and itoa (SHA-256 in-tree). WASM build adds lol_alloc for a single-threaded leaking bump allocator.
Classes support single-level inheritance, super(), full dunder dispatch, @property / @x.setter. Functional-first: composition preferred, monomorphic dispatch optimised via instance-dunder IC.
Concepts
- Offset-based tokens:
(kind, line, start, end)indices into the source buffer. No string copies during lexing; parser slices lazily. - Single-pass SSA codegen: Variables versioned per assignment (
x_1,x_2). Control-flow joins emitPhiopcodes resolved at runtime. - Token-threaded dispatch:
Vec<Instruction>where each is(opcode: OpCode, operand: u16). Hot loop is a flatmatch; Rust lowers it to a jump table. Not direct threading (computed-goto isn’t available in safe Rust). - Per-instruction inline caching: Each binary op records operand type tags. After
QUICK_THRESH = 4stable hits the IC stores a typedFastOp(AddInt,AddFloat,AddStr,LtFloat,EqStr,ModInt, …) as a speculative fast path with type-guard deopt. - Template memoisation: pure user functions cache
(args) -> resultafterTPL_THRESH = 2hits, capped at 256 entries each. Gated on no-kw calls, byte-stable args (mutable containers disqualify), and an impurity-free body (purity detection in Syntax). Hashing is an FNV-like fold over rawVal.0bits with a value-eq verify. - NaN-boxed values:
Valis a 64-bit union: 47-bit signed ints (inline), IEEE-754 floats (NaNs canonicalised), bools, None, an undef sentinel, and 28-bit heap indices. - Mark-and-sweep GC: Triggered when
live >= gc_thresholdoralloc_count >= max(live/4, 4096). After each sweepgc_threshold = max(live * 2, 512). Roots: stack, with-stack, yields, event queue, slots and live-slot snapshots, slot templates, globals, every iterator frame’siter_stack, opcode-cache constants, active const pools, function templates.
Bytecode shape
Each Instruction is 4 bytes: 1-byte OpCode (#[repr(u8)] planned), 2-byte operand, 1 byte padding. Opcodes span 17 categories: load, store, arith, bitwise, compare, logic, identity, control flow, iter, build, container, comprehension, function, ssa (Phi), yield, side effects, unsupported (raises at runtime). Around 40 specialised Call* variants for hot builtins; LoadAttr + Call(0) pairs fuse into CallMethod + CallMethodArgs after first dispatch.
OpCode::LoadConst -> operand = constant index
OpCode::LoadName -> operand = name slot
OpCode::StoreName -> operand = name slot
OpCode::Add / Sub -> operand = 0 (IC slot derived from ip)
OpCode::Call -> operand = (kw << 8) | pos
OpCode::Phi -> operand = target slot, sources in chunk.phi_sources
OpCode::ForIter -> operand = jump target on iterator exhaustionDispatch shape
Hot loop reads cache.fused_ref()[ip], a snapshot of the instruction stream with LoadAttr + Call(0) pairs fused into CallMethod + CallMethodArgs. Fusion runs once per chunk, cached.
For arith/compare opcodes, the loop checks cache.get_fast(ip): if a FastOp is present, it runs inline without a function call; type-guard miss invalidates the slot and falls back to the generic handler. IC is per-instruction, so monomorphic sites stabilise independently.
LoadConst reads a pre-materialised Vec<Val> (OpcodeCache::const_vals) built on first dispatch. Inline-range ints (47-bit) stored inline; ints from 2⁴⁷ to 2¹²⁷ allocate a HeapObj::LongInt slot. Literals beyond ±2¹²⁷ are rejected at parse time.
Memory model
Val is 64 bits NaN-boxed (QNAN = 0x7FFC_0000_0000_0000, SIGN = 0x8000...):
| Tag | Pattern | Notes |
|---|---|---|
| Float | any non-canonical IEEE-754 | Quiet NaNs remapped to 0x7FF8... |
| Int | QNAN | SIGN | i48 | 47-bit signed inline; auto-promotes to HeapObj::LongInt (i128) on overflow |
| Undef | QNAN | Unbound-local sentinel |
| None | QNAN | 1 | |
| True | QNAN | 2 | |
| False | QNAN | 3 | |
| Heap | QNAN | 4 | (i28 << 4) | 28-bit index into HeapPool (max 1 << 28 slots) |
INT_MAX = 140_737_488_355_327, INT_MIN = -140_737_488_355_328. Inline ints take one ALU op per arithmetic; overflow promotes to HeapObj::LongInt(i128) until results fit inline again. LongInts are interned by value so equal values share a heap index (consistent hash/eq). Hard cap ±2¹²⁷; wider raises OverflowError. Arbitrary-precision bigints would need a Vec<u32>-limb variant (heap-allocs per op, Knuth D / Karatsuba code) or dropping NaN-boxing, both regress WASM-size and inner-loop goals.
PartialEq / Hash for Val reconcile value-equal numerics so 1 == 1.0 and hash(1) == hash(1.0) — dicts/sets see them as one key. Hash writes an inline int (and any integral float in range) as its i64 value, falling back to f64 bits only for non-integral floats; hashing the f64 bits directly would funnel small integers (whose low mantissa bits are zero) into one FxHasher bucket and collapse int-keyed dict/set to O(n²). FxBuildHasher uses a fixed seed for reproducible iteration order across runs.
Heap is a Vec<HeapSlot> arena with a free list (capped 524,288, sorted to prefer low indices). Strings, bytes (<=128 B), and LongInts are interned in side hashes, so equal values collapse to one slot, short literals short-circuit through identity, and dict/set lookups stay consistent across allocations. Live-object cap is Limits.heap (default 10M; sandbox 100K). Single-colour mark-and-sweep, no refcount, cycles reclaimed natively.
HeapObj variants: Str, Bytes, List (Rc<RefCell<Vec<Val>>>), Dict (insertion-ordered), Set, FrozenSet, Tuple, Func(fn_idx, defaults, captures), Range, Slice, Ellipsis (singleton, distinct from '...'), Type, ExcInstance, BoundMethod, NativeFn, Class(name, members), Instance(class, attrs), BoundUserMethod(recv, fn), Coroutine(ip, slots, stack, body, iter_stack, sync_frames) (shared by generators, async def, and the implicit module-body coro; body is BodyRef::Fn(usize) or BodyRef::Module; sync_frames stacks suspended sync sub-calls so a plain def hitting a yielding builtin can resume mid-body), Module(spec, attrs), Extern(Arc<dyn Fn>).
What the compiler intentionally does not do
- No SSA-wide constant propagation through
LoadName(preserved to keep IC, super-op, template paths fast). - No CSE, GVN, LICM, inlining, branch DCE, loop folding. Optimiser is constant folding + phi-noop elimination + dead-instruction compaction with jump-operand remap.
- No dead-store elimination beyond what falls out of constant folding.
- No IR, one representation between source and dispatch.
- No JIT (single-tier, pure Rust). Method JITs need per-arch stencils; trace JITs duplicate the execution model and complicate GC.
- No runtime module system — imports resolve at parse time through a host-injected
Resolver. See Imports. - No bigints, complex numbers,
bytearray,memoryview,Decimal,Fraction. Nogen.send/throw/close. Noasynciomodule — concurrency primitives are top-level builtins (Async).
Coroutine and context-manager dispatch
async def and yield-bearing def both produce HeapObj::Coroutine. run() drives the scheduler; the other primitives are top-level builtins (Async).
A plain def inside a coroutine calling a yielding builtin gets its state (ip, slots, stack/iter deltas) snapshotted as a SyncFrame pushed on the enclosing Coroutine’s sync_frames (innermost-last). resume_coroutine walks this stack inside-out before re-entering the outer body, so each helper’s return value lands at the original Call site; otherwise the outer’s resume_ip would skip past the unfinished helper.
vm.run() wraps the module body as an implicit coroutine with BodyRef::Module; top-level statements suspend like async def bodies. Single-driver: top_loop is the only place that picks coros; run / gather / with_timeout are non-driving: they push targets to the scheduler, park the outer in CoroState::WaitingForChildren { tasks, kind: WaitKind }, and yield. WaitKind picks finalize behavior: Run(target) returns its value, Gather returns the list of results, and Timeout { deadline_ns, target } enforces a deadline. wake_waiting_outers (gated by waiting_for_children_count) drops terminal children, splices the result into the outer’s saved stack placeholder, marks the outer Ready, or raise_into_outer injects the exception.
Coroutines carry their own exception_frames (7th tuple field). resume_coroutine denormalises stored depths (relative to saved_stack_len / saved_iter_len) on entry, pushes them onto the live exception_stack, renormalises on yield-save; dispatch.rs::exec honors pending_exec_exc_base so handler search includes restored frames. Net: try/except survives yields; try: run(coro) except E: catches a child’s raise across multiple run_resume cycles. SyncFrame.exception_delta does the same for sync-helper try blocks spanning a yield.
with invokes __enter__ / __exit__(exc_type, exc_val, traceback); truthy __exit__ return suppresses. async with reuses sync __enter__ / __exit__ (no async dunders).
References
- Aho, Sethi & Ullman. Compilers: Principles, Techniques and Tools (1986). LUT-based lexer.
- Pratt. Top Down Operator Precedence (POPL 1973).
- Cytron et al. Efficiently Computing Static Single Assignment Form (TOPLAS 1991).
- Gudeman. Representing Type Information in Dynamically Typed Languages (1993). NaN-boxing.
- Deutsch & Schiffman. Efficient Implementation of the Smalltalk-80 System (POPL 1984). Inline caching.
- Ertl & Gregg. The Structure and Performance of Efficient Interpreters (JILP 2003). Threaded dispatch.
- Casey et al. Towards Superinstructions for Java Interpreters (SCOPES 2003). LoadAttr+Call fusion.
- Michie. Memo Functions and Machine Learning (Nature 1968). Pure-function memoization.
- McCarthy. Recursive Functions of Symbolic Expressions (CACM 1960). Mark-sweep GC.
- Backus. Can Programming Be Liberated from the von Neumann Style? (CACM 1978). Function-level paradigm.