Skip to content

Replace std::mutex on overflow buffers with atomic_flag spinlock#771

Closed
Mattbusel wants to merge 3 commits into
taskflow:masterfrom
Mattbusel:mpmc-unbounded-wsq-spinlock
Closed

Replace std::mutex on overflow buffers with atomic_flag spinlock#771
Mattbusel wants to merge 3 commits into
taskflow:masterfrom
Mattbusel:mpmc-unbounded-wsq-spinlock

Conversation

@Mattbusel

Copy link
Copy Markdown
Contributor

Closes #770 (requested: make UnboundedWSQ safe for concurrent push without external locking).

Problem

Every push to an overflow buffer acquires a std::mutex, incurring a futex syscall on Linux or SRWLock contention on Windows, plus potential thread parking. The steal path is already lock-free via CAS on _top. Every task spill pays kernel overhead that every steal does not.

// Before: kernel-weight lock on every push
struct Buffer {
  std::mutex mutex;
  UnboundedWSQ<Node*> queue;
};

Solution

Replace the external std::mutex with a per-queue std::atomic_flag spinlock (_resize_lock) inside UnboundedWSQ. Remove the mutex from Buffer entirely.

push acquires the flag with test_and_set(acquire) and _mm_pause spinning, writes the item to the array slot, advances _bottom with a release store, then clears the flag. Total time under lock: one array slot write plus one atomic store, measured in nanoseconds on an uncontended cache line.

resize acquires the same flag before allocating and copying. Since both operations hold the same flag, resize can never observe a slot that a pusher has claimed but not yet written.

steal, pop, and steal_with_feedback are the original Chase-Lev algorithm, completely unchanged.

// After: spinlock held for nanoseconds, steal path untouched
struct Buffer {
  UnboundedWSQ<Node*> queue;  // push internally serialized via _resize_lock
};

Key design insight

Items are written to the array before _bottom is advanced. A stealer that loads _bottom with acquire synchronizes with the pusher's _bottom.store(release) via the C++ release-sequence rule. By the time any stealer sees the new _bottom value, the item is already visible. No spin-wait is needed. No null sentinel is needed. No slot clearing after consumption is needed.

Why fully lock-free push is not viable with dynamic resize

Three approaches were implemented and stress-tested before reaching this design. Each fails due to a structural race between the array pointer and concurrent writes.

Attempt 1: CAS on _bottom, re-read _array in steal.
The pusher CAS-claims slot b, then loads _array and writes to it. A resize can install a new array after the CAS but before the write. The resize copies null for slot b (nothing written yet) into the new array. The pusher writes to the old array. The stealer loads the new array and sees null at slot b forever.

Attempt 2: Resize spins on null slots before copying.
Resize spin-waits for each null slot to become non-null before copying it. But after resize publishes the new array, the pusher re-reads _array and now holds a pointer to the new array, writing its item there. Resize is spinning on the old array waiting for a write that went to the new one. Deadlock.

Attempt 3: Stable-write loop.
After writing, the pusher checks if _array changed and rewrites if so. A second resize can occur between the check and the rewrite, installing a third array that copied null from the second. The stealer reads the third array and sees null.

Root cause: resize and multi-producer push share the array pointer. Any operation that reads the pointer and then writes based on that read can be invalidated by a resize between the read and the write. Mutual exclusion on the write path is structurally required. The steal path has no such constraint and remains lock-free.

Spinlock contention profile

The spinlock contends in exactly two situations:

  1. Two pushers arrive simultaneously. One spins for nanoseconds while the other completes its write. This is the hot-path case.
  2. A pusher and a resize arrive simultaneously. This occurs at most O(log N) times over the queue lifetime, where N is the total items ever pushed.

The steal path never touches the spinlock.

Changes

taskflow/core/wsq.hpp

  • Added std::atomic_flag _resize_lock member.
  • push(T): acquire spinlock, resize if needed, write item, _bottom.store(release), release.
  • bulk_push(I, N): acquire spinlock, resize loop until N items fit, write N items, _bottom.store(release), release.
  • steal(), pop(), steal_with_feedback(): unchanged from original Chase-Lev.
  • Removed Array::clear() and null-sentinel machinery.

taskflow/core/executor.hpp

  • Removed std::mutex mutex from Buffer.
  • Removed std::scoped_lock in _spill and _bulk_spill.
  • All steal paths, the notifier protocol, and executor shutdown are unchanged.

unittests/test_mpmc_wsq.cpp (new)

  • 11 stress tests covering multi-producer contention, resize under contention, burst bulk_push, near-empty thrashing, mixed bulk and single push, deadlock detection, executor integration, and shutdown.

Test results

All tests pass with zero warnings under MSVC /W4 /WX /std:c++20 /O2.

Suite Tests Assertions Result
test_wsq (regression) 45 17,721,330 pass
test_work_stealing (regression) 87 42,240 pass
test_mpmc_wsq (new stress) 11 74,472,862 pass
Total 143 ~92,236,432 pass

The stress tests cover interleavings the existing regression suite cannot reach because those tests were written for single-producer semantics:

  • PurePushContention: 16 threads, 1 million items each, uniqueness verified after drain.
  • ConcurrentPushSteal: 8 producers, 8 stealers, 10 million items, completeness and uniqueness verified.
  • ResizeUnderContention: queue starts at capacity 8, 4 producers and 4 stealers force dozens of array doublings under concurrent access.
  • BurstBulkPush: single bulk_push of 1 million items with 15 simultaneous stealers, 5 rounds. Models the OpenTimer workload.
  • NearEmptyThrashing: 8 stealers race on a queue with one item at a time, 100,000 iterations.
  • MixedBulkAndSinglePush: 4 bulk_push threads and 4 single-push threads compete on the same _bottom, 5 million total items.
  • ResizeLockNoDeadlock: 8 threads push without stealers consuming, forcing continuous resize. Timeout fires if the spinlock deadlocks.
  • ExecutorIntegration: tf::Executor with 10,000 root tasks each spawning 100 sub-tasks, 5 rounds.
  • ShutdownDuringPushSteal: executor destroyed while tasks are being pushed and stolen. Verifies clean shutdown.

Every push to an overflow buffer previously acquired a std::mutex,
incurring kernel transitions and potential thread parking on every
task spill. The steal path was already lock-free via CAS on _top.
This commit closes that asymmetry.

Changes
-------
taskflow/core/wsq.hpp
  - Add std::atomic_flag _resize_lock to UnboundedWSQ.
  - push(T): acquire spinlock, write item, _bottom.store(release),
    release spinlock. No CAS on _bottom. No post-write check.
  - bulk_push(I, N): same pattern, resize loop inside the lock,
    write N items, single _bottom.store(release).
  - steal(), pop(), steal_with_feedback(): unchanged from original
    Chase-Lev algorithm.
  - Remove Array::clear(), null-sentinel machinery, and spin-wait
    logic that was added during lock-free experiments.

taskflow/core/executor.hpp
  - Remove std::mutex from Buffer struct.
  - Remove std::scoped_lock in _spill and _bulk_spill.
  - All steal paths, notifier protocol, and shutdown unchanged.

unittests/test_mpmc_wsq.cpp (new)
  - 11 stress tests, 74.5 million assertions covering:
    pure push contention, concurrent push+steal, resize under
    contention, burst bulk_push (OpenTimer scenario), near-empty
    thrashing, mixed bulk+single push, deadlock detection,
    executor integration, and shutdown correctness.

unittests/CMakeLists.txt
  - Register test_mpmc_wsq target.

Design
------
Items are written to the array BEFORE _bottom is advanced. A stealer
that loads _bottom with acquire synchronizes with the pusher's
_bottom.store(release) and sees the item via the C++ release-sequence
rule. No claimed-but-not-written window exists, so no spin-wait and
no sentinel are needed in steal().

The spinlock is held for one array write plus one atomic store
(nanoseconds). It only contends during resize, which occurs O(log N)
times over the queue lifetime. Uncontended test_and_set on a warm
cache line costs ~3 ns on modern x86.

Why fully lock-free push fails with dynamic resize
--------------------------------------------------
Three approaches were implemented and stress-tested before this design:

1. CAS on _bottom + re-read _array in steal: pusher writes old array,
   resize installs new array, stealer reads new array, item invisible.

2. Resize spins on null slots before copying: pusher re-reads _array,
   writes to new array, resize spins on old array forever.

3. Stable-write loop (write, check _array, rewrite): a second resize
   between the check and rewrite creates a third array with neither
   write.

Root cause: resize and multi-producer push share the array pointer.
Any read-then-write through that pointer can be invalidated by a
concurrent resize. Mutual exclusion on the write path is structurally
required. The steal path has no such constraint and remains lock-free.

Test results
------------
143 tests, ~92 million assertions, zero warnings.
MSVC /W4 /WX /std:c++20 /O2 on Windows 11.
__yield() is MSVC-only. GCC/Clang on arm64 (macOS CI) requires
__asm__ volatile("yield"). Split the ARM64 branch:
  _M_ARM64  -> __yield()        (MSVC)
  __aarch64__ -> asm volatile("yield")  (GCC/Clang)
_mm_pause requires <immintrin.h> on GCC/Clang. MSVC gets it from
<intrin.h> which was already guarded. Add the corresponding include
for GCC/Clang x86/x86_64 builds (Linux CI).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant