Replace std::mutex on overflow buffers with atomic_flag spinlock by Mattbusel · Pull Request #771 · taskflow/taskflow

Mattbusel · 2026-03-15T07:59:06Z

Closes #770 (requested: make UnboundedWSQ safe for concurrent push without external locking).

Problem

Every push to an overflow buffer acquires a std::mutex, incurring a futex syscall on Linux or SRWLock contention on Windows, plus potential thread parking. The steal path is already lock-free via CAS on _top. Every task spill pays kernel overhead that every steal does not.

// Before: kernel-weight lock on every push
struct Buffer {
  std::mutex mutex;
  UnboundedWSQ<Node*> queue;
};

Solution

Replace the external std::mutex with a per-queue std::atomic_flag spinlock (_resize_lock) inside UnboundedWSQ. Remove the mutex from Buffer entirely.

push acquires the flag with test_and_set(acquire) and _mm_pause spinning, writes the item to the array slot, advances _bottom with a release store, then clears the flag. Total time under lock: one array slot write plus one atomic store, measured in nanoseconds on an uncontended cache line.

resize acquires the same flag before allocating and copying. Since both operations hold the same flag, resize can never observe a slot that a pusher has claimed but not yet written.

steal, pop, and steal_with_feedback are the original Chase-Lev algorithm, completely unchanged.

// After: spinlock held for nanoseconds, steal path untouched
struct Buffer {
  UnboundedWSQ<Node*> queue;  // push internally serialized via _resize_lock
};

Key design insight

Items are written to the array before _bottom is advanced. A stealer that loads _bottom with acquire synchronizes with the pusher's _bottom.store(release) via the C++ release-sequence rule. By the time any stealer sees the new _bottom value, the item is already visible. No spin-wait is needed. No null sentinel is needed. No slot clearing after consumption is needed.

Why fully lock-free push is not viable with dynamic resize

Three approaches were implemented and stress-tested before reaching this design. Each fails due to a structural race between the array pointer and concurrent writes.

Attempt 1: CAS on _bottom, re-read _array in steal.
The pusher CAS-claims slot b, then loads _array and writes to it. A resize can install a new array after the CAS but before the write. The resize copies null for slot b (nothing written yet) into the new array. The pusher writes to the old array. The stealer loads the new array and sees null at slot b forever.

Attempt 2: Resize spins on null slots before copying.
Resize spin-waits for each null slot to become non-null before copying it. But after resize publishes the new array, the pusher re-reads _array and now holds a pointer to the new array, writing its item there. Resize is spinning on the old array waiting for a write that went to the new one. Deadlock.

Attempt 3: Stable-write loop.
After writing, the pusher checks if _array changed and rewrites if so. A second resize can occur between the check and the rewrite, installing a third array that copied null from the second. The stealer reads the third array and sees null.

Root cause: resize and multi-producer push share the array pointer. Any operation that reads the pointer and then writes based on that read can be invalidated by a resize between the read and the write. Mutual exclusion on the write path is structurally required. The steal path has no such constraint and remains lock-free.

Spinlock contention profile

The spinlock contends in exactly two situations:

Two pushers arrive simultaneously. One spins for nanoseconds while the other completes its write. This is the hot-path case.
A pusher and a resize arrive simultaneously. This occurs at most O(log N) times over the queue lifetime, where N is the total items ever pushed.

The steal path never touches the spinlock.

Changes

taskflow/core/wsq.hpp

Added std::atomic_flag _resize_lock member.
push(T): acquire spinlock, resize if needed, write item, _bottom.store(release), release.
bulk_push(I, N): acquire spinlock, resize loop until N items fit, write N items, _bottom.store(release), release.
steal(), pop(), steal_with_feedback(): unchanged from original Chase-Lev.
Removed Array::clear() and null-sentinel machinery.

taskflow/core/executor.hpp

Removed std::mutex mutex from Buffer.
Removed std::scoped_lock in _spill and _bulk_spill.
All steal paths, the notifier protocol, and executor shutdown are unchanged.

unittests/test_mpmc_wsq.cpp (new)

11 stress tests covering multi-producer contention, resize under contention, burst bulk_push, near-empty thrashing, mixed bulk and single push, deadlock detection, executor integration, and shutdown.

Test results

All tests pass with zero warnings under MSVC /W4 /WX /std:c++20 /O2.

Suite	Tests	Assertions	Result
`test_wsq` (regression)	45	17,721,330	pass
`test_work_stealing` (regression)	87	42,240	pass
`test_mpmc_wsq` (new stress)	11	74,472,862	pass
Total	143	~92,236,432	pass

The stress tests cover interleavings the existing regression suite cannot reach because those tests were written for single-producer semantics:

PurePushContention: 16 threads, 1 million items each, uniqueness verified after drain.
ConcurrentPushSteal: 8 producers, 8 stealers, 10 million items, completeness and uniqueness verified.
ResizeUnderContention: queue starts at capacity 8, 4 producers and 4 stealers force dozens of array doublings under concurrent access.
BurstBulkPush: single bulk_push of 1 million items with 15 simultaneous stealers, 5 rounds. Models the OpenTimer workload.
NearEmptyThrashing: 8 stealers race on a queue with one item at a time, 100,000 iterations.
MixedBulkAndSinglePush: 4 bulk_push threads and 4 single-push threads compete on the same _bottom, 5 million total items.
ResizeLockNoDeadlock: 8 threads push without stealers consuming, forcing continuous resize. Timeout fires if the spinlock deadlocks.
ExecutorIntegration: tf::Executor with 10,000 root tasks each spawning 100 sub-tasks, 5 rounds.
ShutdownDuringPushSteal: executor destroyed while tasks are being pushed and stolen. Verifies clean shutdown.

Every push to an overflow buffer previously acquired a std::mutex, incurring kernel transitions and potential thread parking on every task spill. The steal path was already lock-free via CAS on _top. This commit closes that asymmetry. Changes ------- taskflow/core/wsq.hpp - Add std::atomic_flag _resize_lock to UnboundedWSQ. - push(T): acquire spinlock, write item, _bottom.store(release), release spinlock. No CAS on _bottom. No post-write check. - bulk_push(I, N): same pattern, resize loop inside the lock, write N items, single _bottom.store(release). - steal(), pop(), steal_with_feedback(): unchanged from original Chase-Lev algorithm. - Remove Array::clear(), null-sentinel machinery, and spin-wait logic that was added during lock-free experiments. taskflow/core/executor.hpp - Remove std::mutex from Buffer struct. - Remove std::scoped_lock in _spill and _bulk_spill. - All steal paths, notifier protocol, and shutdown unchanged. unittests/test_mpmc_wsq.cpp (new) - 11 stress tests, 74.5 million assertions covering: pure push contention, concurrent push+steal, resize under contention, burst bulk_push (OpenTimer scenario), near-empty thrashing, mixed bulk+single push, deadlock detection, executor integration, and shutdown correctness. unittests/CMakeLists.txt - Register test_mpmc_wsq target. Design ------ Items are written to the array BEFORE _bottom is advanced. A stealer that loads _bottom with acquire synchronizes with the pusher's _bottom.store(release) and sees the item via the C++ release-sequence rule. No claimed-but-not-written window exists, so no spin-wait and no sentinel are needed in steal(). The spinlock is held for one array write plus one atomic store (nanoseconds). It only contends during resize, which occurs O(log N) times over the queue lifetime. Uncontended test_and_set on a warm cache line costs ~3 ns on modern x86. Why fully lock-free push fails with dynamic resize -------------------------------------------------- Three approaches were implemented and stress-tested before this design: 1. CAS on _bottom + re-read _array in steal: pusher writes old array, resize installs new array, stealer reads new array, item invisible. 2. Resize spins on null slots before copying: pusher re-reads _array, writes to new array, resize spins on old array forever. 3. Stable-write loop (write, check _array, rewrite): a second resize between the check and rewrite creates a third array with neither write. Root cause: resize and multi-producer push share the array pointer. Any read-then-write through that pointer can be invalidated by a concurrent resize. Mutual exclusion on the write path is structurally required. The steal path has no such constraint and remains lock-free. Test results ------------ 143 tests, ~92 million assertions, zero warnings. MSVC /W4 /WX /std:c++20 /O2 on Windows 11.

__yield() is MSVC-only. GCC/Clang on arm64 (macOS CI) requires __asm__ volatile("yield"). Split the ARM64 branch: _M_ARM64 -> __yield() (MSVC) __aarch64__ -> asm volatile("yield") (GCC/Clang)

_mm_pause requires <immintrin.h> on GCC/Clang. MSVC gets it from <intrin.h> which was already guarded. Add the corresponding include for GCC/Clang x86/x86_64 builds (Linux CI).

Mattbusel added 3 commits March 15, 2026 04:20

fix ARM64 pause intrinsic for GCC/Clang

c41ec23

__yield() is MSVC-only. GCC/Clang on arm64 (macOS CI) requires __asm__ volatile("yield"). Split the ARM64 branch: _M_ARM64 -> __yield() (MSVC) __aarch64__ -> asm volatile("yield") (GCC/Clang)

fix x86 pause intrinsic header for GCC/Clang

a09b018

_mm_pause requires <immintrin.h> on GCC/Clang. MSVC gets it from <intrin.h> which was already guarded. Add the corresponding include for GCC/Clang x86/x86_64 builds (Linux CI).

Mattbusel force-pushed the mpmc-unbounded-wsq-spinlock branch from 0f60453 to a09b018 Compare March 15, 2026 08:20

Mattbusel closed this Mar 15, 2026

Mattbusel mentioned this pull request Mar 22, 2026

Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub #769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace std::mutex on overflow buffers with atomic_flag spinlock#771

Replace std::mutex on overflow buffers with atomic_flag spinlock#771
Mattbusel wants to merge 3 commits into
taskflow:masterfrom
Mattbusel:mpmc-unbounded-wsq-spinlock

Mattbusel commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Mattbusel commented Mar 15, 2026

Problem

Solution

Key design insight

Why fully lock-free push is not viable with dynamic resize

Spinlock contention profile

Changes

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant