Specialized ops (RustPython#7322)

youknowone · github-actions[bot] · web-flow · commit be9152961f15 · 2026-03-04T15:39:48.000+09:00
* Add CALL_ALLOC_AND_ENTER_INIT specialization

Optimizes user-defined class instantiation MyClass(args...)
when tp_new == object.__new__ and __init__ is a simple
PyFunction. Allocates the object directly and calls __init__
via invoke_exact_args, bypassing the generic type.__call__
dispatch path.

* Invalidate JIT cache when __code__ is reassigned

Change jitted_code from OnceCell to PyMutex&lt;Option&lt;CompiledCode&gt;&gt; so
it can be cleared on __code__ assignment. The setter now sets the
cached JIT code to None to prevent executing stale machine code.

* Atomic operations for specialization cache

- range iterator: deduplicate fast_next/next_fast
- Replace raw pointer reads/writes in CodeUnits with atomic
  operations (AtomicU8/AtomicU16) for thread safety
- Add read_op (Acquire), read_arg (Relaxed), compare_exchange_op
- Use Release ordering in replace_op to synchronize cache writes
- Dispatch loop reads opcodes atomically via read_op/read_arg
- Fix adaptive counter access: use read/write_adaptive_counter
  instead of read/write_cache_u16 (was reading wrong bytes)
- Add pre-check guards to all specialize_* functions to prevent
  concurrent specialization races
- Move modified() before attribute changes in type.__setattr__
  to prevent use-after-free of cached descriptors
- Use SeqCst ordering in modified() for version invalidation
- Add Release fence after quicken() initialization

* Fix slot wrapper override for inherited attributes

For __getattribute__: only use getattro_wrapper when the type
itself defines the attribute; otherwise inherit native slot from
base class via MRO.

For __setattr__/__delattr__: only store setattro_wrapper when
the type has its own __setattr__ or __delattr__; otherwise keep
the inherited base slot.

* Fix StoreAttrSlot cache overflow corrupting next instruction

write_cache_u32 at cache_base+3 writes 2 code units (positions 3 and 4),
but STORE_ATTR only has 4 cache entries (positions 0-3). This overwrites
the next instruction with the upper 16 bits of the slot offset.

Changed to write_cache_u16/read_cache_u16 since member descriptor offsets
fit within u16 (max 65535 bytes).

* Exclude method_descriptor from has_python_cmp check

has_python_cmp incorrectly treated method_descriptor as Python-level
comparison methods, causing richcompare slot to use wrapper dispatch
instead of inheriting the native slot.

* Fix CompareOpFloat NaN handling

partial_cmp returns None for NaN comparisons. is_some_and incorrectly
returned false for all NaN comparisons, but NaN != x should be true
per IEEE 754 semantics.

* Fix invoke_exact_args borrow in CallAllocAndEnterInit

* Distinguish Python method vs not-found in slot MRO lookup

Change lookup_slot_in_mro to return a 3-state SlotLookupResult
enum (NativeSlot/PythonMethod/NotFound) instead of Option&lt;T&gt;.

Previously, both "found a Python-level method" and "found nothing"
returned None, causing incorrect slot inheritance. For example,
class Test(Mixin, TestCase) would inherit object.slot_init from
Mixin via inherit_from_mro instead of using init_wrapper to
dispatch TestCase.__init__.

Apply this fix consistently to all slot update sites:
update_main_slot!, update_sub_slot!, TpGetattro, TpSetattro,
TpDescrSet, TpHash, TpRichcompare, SqAssItem, MpAssSubscript.

* Extract specialization helper functions to reduce boilerplate

- deoptimize() / deoptimize_at(): replace specialized op with base op
- adaptive(): decrement warmup counter or call specialize function
- commit_specialization(): replace op on success, backoff on failure
- execute_binary_op_int() / execute_binary_op_float(): typed binary ops

Removes 10 duplicate deoptimize_* functions, consolidates 13 adaptive
counter blocks, 6 binary op handlers, and 7 specialize tail patterns.
Also replaces inline deopt blocks in LoadAttr/StoreAttr handlers.

* Improve specialization guards and fix mark_stacks

- CONTAINS_OP_SET: add frozenset support in handler and specialize
- TO_BOOL_ALWAYS_TRUE: cache type version instead of checking slots
- LOAD_GLOBAL_BUILTIN: cache builtins dict version alongside globals
- mark_stacks: deoptimize specialized opcodes for correct reachability

* Auto-format: cargo fmt --all

---------

Co-authored-by: github-actions[bot] &lt;github-actions[bot]@users.noreply.github.com&gt;
diff --git a/crates/compiler-core/src/bytecode.rs b/crates/compiler-core/src/bytecode.rs
@@ -8,7 +8,12 @@ use crate::{
 };
 use alloc::{borrow::ToOwned, boxed::Box, collections::BTreeSet, fmt, string::String, vec::Vec};
 use bitflags::bitflags;
-use core::{cell::UnsafeCell, hash, mem, ops::Deref};
+use core::{
+    cell::UnsafeCell,
+    hash, mem,
+    ops::Deref,
+    sync::atomic::{AtomicU8, AtomicU16, Ordering},
+};
 use itertools::Itertools;
 use malachite_bigint::BigInt;
 use num_complex::Complex64;
@@ -367,8 +372,13 @@ impl TryFrom<&[u8]> for CodeUnit {
 
 pub struct CodeUnits(UnsafeCell<Box<[CodeUnit]>>);
 
-// SAFETY: All mutation of the inner buffer is serialized by `monitoring_data: PyMutex`
-// in `PyCode`. The `UnsafeCell` is required because `replace_op` mutates through `&self`.
+// SAFETY: All cache operations use atomic read/write instructions.
+// - replace_op / compare_exchange_op: AtomicU8 store/CAS (Release)
+// - cache read/write: AtomicU16 load/store (Relaxed)
+// - adaptive counter: AtomicU8 load/store (Relaxed)
+// Ordering is established by:
+// - replace_op (Release) ↔ dispatch loop read_op (Acquire) for cache data visibility
+// - tp_version_tag (Acquire) for descriptor pointer validity
 unsafe impl Sync for CodeUnits {}
 
 impl Clone for CodeUnits {
@@ -435,45 +445,81 @@ impl Deref for CodeUnits {
 
 impl CodeUnits {
     /// Replace the opcode at `index` in-place without changing the arg byte.
+    /// Uses atomic Release store to ensure prior cache writes are visible
+    /// to threads that subsequently read the new opcode with Acquire.
     ///
     /// # Safety
     /// - `index` must be in bounds.
     /// - `new_op` must have the same arg semantics as the original opcode.
-    /// - The caller must ensure exclusive access to the instruction buffer
-    ///   (no concurrent reads or writes to the same `CodeUnits`).
     pub unsafe fn replace_op(&self, index: usize, new_op: Instruction) {
-        unsafe {
-            let units = &mut *self.0.get();
-            let unit_ptr = units.as_mut_ptr().add(index);
-            // Write only the opcode byte (first byte of CodeUnit due to #[repr(C)])
-            let op_ptr = unit_ptr as *mut u8;
-            core::ptr::write(op_ptr, new_op.into());
-        }
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU8;
+        unsafe { &*ptr }.store(new_op.into(), Ordering::Release);
+    }
+
+    /// Atomically replace opcode only if it still matches `expected`.
+    /// Returns true on success. Uses Release ordering on success.
+    ///
+    /// # Safety
+    /// - `index` must be in bounds.
+    pub unsafe fn compare_exchange_op(
+        &self,
+        index: usize,
+        expected: Instruction,
+        new_op: Instruction,
+    ) -> bool {
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU8;
+        unsafe { &*ptr }
+            .compare_exchange(
+                expected.into(),
+                new_op.into(),
+                Ordering::Release,
+                Ordering::Relaxed,
+            )
+            .is_ok()
+    }
+
+    /// Atomically read the opcode at `index` with Acquire ordering.
+    /// Pairs with `replace_op` (Release) to ensure cache data visibility.
+    pub fn read_op(&self, index: usize) -> Instruction {
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU8;
+        let byte = unsafe { &*ptr }.load(Ordering::Acquire);
+        // SAFETY: Only valid Instruction values are stored via replace_op/compare_exchange_op.
+        unsafe { mem::transmute::<u8, Instruction>(byte) }
+    }
+
+    /// Atomically read the arg byte at `index` with Relaxed ordering.
+    pub fn read_arg(&self, index: usize) -> OpArgByte {
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const u8;
+        let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
+        OpArgByte::from(unsafe { &*arg_ptr }.load(Ordering::Relaxed))
     }
 
     /// Write a u16 value into a CACHE code unit at `index`.
     /// Each CodeUnit is 2 bytes (#[repr(C)]: op u8 + arg u8), so one u16 fits exactly.
+    /// Uses Relaxed atomic store; ordering is provided by replace_op (Release).
     ///
     /// # Safety
     /// - `index` must be in bounds and point to a CACHE entry.
-    /// - The caller must ensure no concurrent reads/writes to the same slot.
     pub unsafe fn write_cache_u16(&self, index: usize, value: u16) {
-        unsafe {
-            let units = &mut *self.0.get();
-            let ptr = units.as_mut_ptr().add(index) as *mut u8;
-            core::ptr::write_unaligned(ptr as *mut u16, value);
-        }
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU16;
+        unsafe { &*ptr }.store(value, Ordering::Relaxed);
     }
 
     /// Read a u16 value from a CACHE code unit at `index`.
+    /// Uses Relaxed atomic load; ordering is provided by read_op (Acquire).
     ///
     /// # Panics
     /// Panics if `index` is out of bounds.
     pub fn read_cache_u16(&self, index: usize) -> u16 {
         let units = unsafe { &*self.0.get() };
         assert!(index < units.len(), "read_cache_u16: index out of bounds");
-        let ptr = units.as_ptr().wrapping_add(index) as *const u8;
-        unsafe { core::ptr::read_unaligned(ptr as *const u16) }
+        let ptr = units.as_ptr().wrapping_add(index) as *const AtomicU16;
+        unsafe { &*ptr }.load(Ordering::Relaxed)
     }
 
     /// Write a u32 value across two consecutive CACHE code units starting at `index`.
@@ -518,36 +564,40 @@ impl CodeUnits {
         lo | (hi << 32)
     }
 
-    /// Read the adaptive counter from the first CACHE entry's `arg` byte.
-    /// This preserves `op = Instruction::Cache`, unlike `read_cache_u16`.
+    /// Read the adaptive counter from the CACHE entry's `arg` byte at `index`.
+    /// Uses Relaxed atomic load.
     pub fn read_adaptive_counter(&self, index: usize) -> u8 {
         let units = unsafe { &*self.0.get() };
-        u8::from(units[index].arg)
+        let ptr = units.as_ptr().wrapping_add(index) as *const u8;
+        let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
+        unsafe { &*arg_ptr }.load(Ordering::Relaxed)
     }
 
-    /// Write the adaptive counter to the first CACHE entry's `arg` byte.
-    /// This preserves `op = Instruction::Cache`, unlike `write_cache_u16`.
+    /// Write the adaptive counter to the CACHE entry's `arg` byte at `index`.
+    /// Uses Relaxed atomic store.
     ///
     /// # Safety
     /// - `index` must be in bounds and point to a CACHE entry.
     pub unsafe fn write_adaptive_counter(&self, index: usize, value: u8) {
-        let units = unsafe { &mut *self.0.get() };
-        units[index].arg = OpArgByte::from(value);
+        let units = unsafe { &*self.0.get() };
+        let ptr = units.as_ptr().wrapping_add(index) as *const u8;
+        let arg_ptr = unsafe { ptr.add(1) } as *const AtomicU8;
+        unsafe { &*arg_ptr }.store(value, Ordering::Relaxed);
     }
 
     /// Produce a clean copy of the bytecode suitable for serialization
     /// (marshal) and `co_code`. Specialized opcodes are mapped back to their
     /// base variants via `deoptimize()` and all CACHE entries are zeroed.
     pub fn original_bytes(&self) -> Vec<u8> {
-        let units = unsafe { &*self.0.get() };
-        let mut out = Vec::with_capacity(units.len() * 2);
-        let len = units.len();
+        let len = self.len();
+        let mut out = Vec::with_capacity(len * 2);
         let mut i = 0;
         while i < len {
-            let op = units[i].op.deoptimize();
+            let op = self.read_op(i).deoptimize();
+            let arg = self.read_arg(i);
             let caches = op.cache_entries();
             out.push(u8::from(op));
-            out.push(u8::from(units[i].arg));
+            out.push(u8::from(arg));
             // Zero-fill all CACHE entries (counter + cached data)
             for _ in 0..caches {
                 i += 1;
@@ -562,20 +612,22 @@ impl CodeUnits {
     /// Initialize adaptive warmup counters for all cacheable instructions.
     /// Called lazily at RESUME (first execution of a code object).
     /// Uses the `arg` byte of the first CACHE entry, preserving `op = Instruction::Cache`.
+    /// All writes are atomic (Relaxed) to avoid data races with concurrent readers.
     pub fn quicken(&self) {
-        let units = unsafe { &mut *self.0.get() };
-        let len = units.len();
+        let len = self.len();
         let mut i = 0;
         while i < len {
-            let op = units[i].op;
+            let op = self.read_op(i);
             let caches = op.cache_entries();
             if caches > 0 {
                 // Don't write adaptive counter for instrumented opcodes;
                 // specialization is skipped while monitoring is active.
                 if !op.is_instrumented() {
                     let cache_base = i + 1;
                     if cache_base < len {
-                        units[cache_base].arg = OpArgByte::from(ADAPTIVE_WARMUP_VALUE);
+                        unsafe {
+                            self.write_adaptive_counter(cache_base, ADAPTIVE_WARMUP_VALUE);
+                        }
                     }
                 }
                 i += 1 + caches;
diff --git a/crates/vm/src/builtins/frame.rs b/crates/vm/src/builtins/frame.rs
@@ -182,8 +182,8 @@ pub(crate) mod stack_analysis {
                 }
                 oparg = (oparg << 8) | u32::from(u8::from(instructions[i].arg));
 
-                // De-instrument: get the underlying real instruction
-                let opcode = opcode.to_base().unwrap_or(opcode);
+                // De-instrument and de-specialize: get the underlying base instruction
+                let opcode = opcode.to_base().unwrap_or(opcode).deoptimize();
 
                 let caches = opcode.cache_entries();
                 let next_i = i + 1 + caches;
diff --git a/crates/vm/src/builtins/function.rs b/crates/vm/src/builtins/function.rs
@@ -5,8 +5,6 @@ use super::{
     PyAsyncGen, PyCode, PyCoroutine, PyDictRef, PyGenerator, PyModule, PyStr, PyStrRef, PyTuple,
     PyTupleRef, PyType,
 };
-#[cfg(feature = "jit")]
-use crate::common::lock::OnceCell;
 use crate::common::lock::PyMutex;
 use crate::function::ArgMapping;
 use crate::object::{PyAtomicRef, Traverse, TraverseFn};
@@ -75,7 +73,7 @@ pub struct PyFunction {
     doc: PyMutex<PyObjectRef>,
     func_version: AtomicU32,
     #[cfg(feature = "jit")]
-    jitted_code: OnceCell<CompiledCode>,
+    jitted_code: PyMutex<Option<CompiledCode>>,
 }
 
 static FUNC_VERSION_COUNTER: AtomicU32 = AtomicU32::new(1);
@@ -214,7 +212,7 @@ impl PyFunction {
             doc: PyMutex::new(doc),
             func_version: AtomicU32::new(next_func_version()),
             #[cfg(feature = "jit")]
-            jitted_code: OnceCell::new(),
+            jitted_code: PyMutex::new(None),
         };
         Ok(func)
     }
@@ -538,7 +536,7 @@ impl Py<PyFunction> {
         vm: &VirtualMachine,
     ) -> PyResult {
         #[cfg(feature = "jit")]
-        if let Some(jitted_code) = self.jitted_code.get() {
+        if let Some(jitted_code) = self.jitted_code.lock().as_ref() {
             use crate::convert::ToPyObject;
             match jit::get_jit_args(self, &func_args, jitted_code, vm) {
                 Ok(args) => {
@@ -712,6 +710,10 @@ impl PyFunction {
     #[pygetset(setter)]
     fn set___code__(&self, code: PyRef<PyCode>, vm: &VirtualMachine) {
         self.code.swap_to_temporary_refs(code, vm);
+        #[cfg(feature = "jit")]
+        {
+            *self.jitted_code.lock() = None;
+        }
         self.func_version.store(0, Relaxed);
     }
 
@@ -948,15 +950,15 @@ impl PyFunction {
     #[cfg(feature = "jit")]
     #[pymethod]
     fn __jit__(zelf: PyRef<Self>, vm: &VirtualMachine) -> PyResult<()> {
-        if zelf.jitted_code.get().is_some() {
+        if zelf.jitted_code.lock().is_some() {
             return Ok(());
         }
         let arg_types = jit::get_jit_arg_types(&zelf, vm)?;
         let ret_type = jit::jit_ret_type(&zelf, vm)?;
         let code: &Py<PyCode> = &zelf.code;
         let compiled = rustpython_jit::compile(&code.code, &arg_types, ret_type)
             .map_err(|err| jit::new_jit_error(err.to_string(), vm))?;
-        let _ = zelf.jitted_code.set(compiled);
+        *zelf.jitted_code.lock() = Some(compiled);
         Ok(())
     }
 }
diff --git a/crates/vm/src/builtins/type.rs b/crates/vm/src/builtins/type.rs
@@ -347,7 +347,7 @@ impl PyType {
         if old_version == 0 {
             return;
         }
-        self.tp_version_tag.store(0, Ordering::Release);
+        self.tp_version_tag.store(0, Ordering::SeqCst);
         // Release strong references held by cache entries for this version.
         // We hold owned refs that would prevent GC of class attributes after
         // type deletion.
@@ -2168,6 +2168,11 @@ impl SetAttr for PyType {
         }
         let assign = value.is_assign();
 
+        // Invalidate inline caches before modifying attributes.
+        // This ensures other threads see the version invalidation before
+        // any attribute changes, preventing use-after-free of cached descriptors.
+        zelf.modified();
+
         if let PySetterValue::Assign(value) = value {
             zelf.attributes.write().insert(attr_name, value);
         } else {
@@ -2180,8 +2185,6 @@ impl SetAttr for PyType {
                 )));
             }
         }
-        // Invalidate inline caches that depend on this type's attributes
-        zelf.modified();
 
         if attr_name.as_wtf8().starts_with("__") && attr_name.as_wtf8().ends_with("__") {
             if assign {
diff --git a/crates/vm/src/dict_inner.rs b/crates/vm/src/dict_inner.rs
@@ -19,7 +19,10 @@ use crate::{
 use alloc::fmt;
 use core::mem::size_of;
 use core::ops::ControlFlow;
-use core::sync::atomic::{AtomicU64, Ordering::Relaxed};
+use core::sync::atomic::{
+    AtomicU64,
+    Ordering::{Acquire, Release},
+};
 use num_traits::ToPrimitive;
 
 // HashIndex is intended to be same size with hash::PyHash
@@ -261,12 +264,12 @@ type PopInnerResult<T> = ControlFlow<Option<DictEntry<T>>>;
 impl<T: Clone> Dict<T> {
     /// Monotonically increasing version counter for mutation tracking.
     pub fn version(&self) -> u64 {
-        self.version.load(Relaxed)
+        self.version.load(Acquire)
     }
 
     /// Bump the version counter after any mutation.
     fn bump_version(&self) {
-        self.version.fetch_add(1, Relaxed);
+        self.version.fetch_add(1, Release);
     }
 
     fn read(&self) -> PyRwLockReadGuard<'_, DictInner<T>> {
diff --git a/crates/vm/src/frame.rs b/crates/vm/src/frame.rs
diff --git a/crates/vm/src/types/slot.rs b/crates/vm/src/types/slot.rs

Original file line number	Diff line number	Diff line change
`@@ -182,8 +182,8 @@ pub(crate) mod stack_analysis {`
`182`	`182`	`}`
`183`	`183`	`oparg = (oparg << 8) \| u32::from(u8::from(instructions[i].arg));`
`184`	`184`
`185`		`- // De-instrument: get the underlying real instruction`
`186`		`- let opcode = opcode.to_base().unwrap_or(opcode);`
	`185`	`+ // De-instrument and de-specialize: get the underlying base instruction`
	`186`	`+ let opcode = opcode.to_base().unwrap_or(opcode).deoptimize();`
`187`	`187`
`188`	`188`	`let caches = opcode.cache_entries();`
`189`	`189`	`let next_i = i + 1 + caches;`