Replace std::mutex on overflow buffers with atomic_flag spinlock#771
Closed
Mattbusel wants to merge 3 commits into
Closed
Replace std::mutex on overflow buffers with atomic_flag spinlock#771Mattbusel wants to merge 3 commits into
Mattbusel wants to merge 3 commits into
Conversation
Every push to an overflow buffer previously acquired a std::mutex,
incurring kernel transitions and potential thread parking on every
task spill. The steal path was already lock-free via CAS on _top.
This commit closes that asymmetry.
Changes
-------
taskflow/core/wsq.hpp
- Add std::atomic_flag _resize_lock to UnboundedWSQ.
- push(T): acquire spinlock, write item, _bottom.store(release),
release spinlock. No CAS on _bottom. No post-write check.
- bulk_push(I, N): same pattern, resize loop inside the lock,
write N items, single _bottom.store(release).
- steal(), pop(), steal_with_feedback(): unchanged from original
Chase-Lev algorithm.
- Remove Array::clear(), null-sentinel machinery, and spin-wait
logic that was added during lock-free experiments.
taskflow/core/executor.hpp
- Remove std::mutex from Buffer struct.
- Remove std::scoped_lock in _spill and _bulk_spill.
- All steal paths, notifier protocol, and shutdown unchanged.
unittests/test_mpmc_wsq.cpp (new)
- 11 stress tests, 74.5 million assertions covering:
pure push contention, concurrent push+steal, resize under
contention, burst bulk_push (OpenTimer scenario), near-empty
thrashing, mixed bulk+single push, deadlock detection,
executor integration, and shutdown correctness.
unittests/CMakeLists.txt
- Register test_mpmc_wsq target.
Design
------
Items are written to the array BEFORE _bottom is advanced. A stealer
that loads _bottom with acquire synchronizes with the pusher's
_bottom.store(release) and sees the item via the C++ release-sequence
rule. No claimed-but-not-written window exists, so no spin-wait and
no sentinel are needed in steal().
The spinlock is held for one array write plus one atomic store
(nanoseconds). It only contends during resize, which occurs O(log N)
times over the queue lifetime. Uncontended test_and_set on a warm
cache line costs ~3 ns on modern x86.
Why fully lock-free push fails with dynamic resize
--------------------------------------------------
Three approaches were implemented and stress-tested before this design:
1. CAS on _bottom + re-read _array in steal: pusher writes old array,
resize installs new array, stealer reads new array, item invisible.
2. Resize spins on null slots before copying: pusher re-reads _array,
writes to new array, resize spins on old array forever.
3. Stable-write loop (write, check _array, rewrite): a second resize
between the check and rewrite creates a third array with neither
write.
Root cause: resize and multi-producer push share the array pointer.
Any read-then-write through that pointer can be invalidated by a
concurrent resize. Mutual exclusion on the write path is structurally
required. The steal path has no such constraint and remains lock-free.
Test results
------------
143 tests, ~92 million assertions, zero warnings.
MSVC /W4 /WX /std:c++20 /O2 on Windows 11.
__yield() is MSVC-only. GCC/Clang on arm64 (macOS CI) requires
__asm__ volatile("yield"). Split the ARM64 branch:
_M_ARM64 -> __yield() (MSVC)
__aarch64__ -> asm volatile("yield") (GCC/Clang)
_mm_pause requires <immintrin.h> on GCC/Clang. MSVC gets it from <intrin.h> which was already guarded. Add the corresponding include for GCC/Clang x86/x86_64 builds (Linux CI).
0f60453 to
a09b018
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #770 (requested: make
UnboundedWSQsafe for concurrent push without external locking).Problem
Every
pushto an overflow buffer acquires astd::mutex, incurring afutexsyscall on Linux orSRWLockcontention on Windows, plus potential thread parking. The steal path is already lock-free via CAS on_top. Every task spill pays kernel overhead that every steal does not.Solution
Replace the external
std::mutexwith a per-queuestd::atomic_flagspinlock (_resize_lock) insideUnboundedWSQ. Remove the mutex fromBufferentirely.pushacquires the flag withtest_and_set(acquire)and_mm_pausespinning, writes the item to the array slot, advances_bottomwith areleasestore, then clears the flag. Total time under lock: one array slot write plus one atomic store, measured in nanoseconds on an uncontended cache line.resizeacquires the same flag before allocating and copying. Since both operations hold the same flag, resize can never observe a slot that a pusher has claimed but not yet written.steal,pop, andsteal_with_feedbackare the original Chase-Lev algorithm, completely unchanged.Key design insight
Items are written to the array before
_bottomis advanced. A stealer that loads_bottomwithacquiresynchronizes with the pusher's_bottom.store(release)via the C++ release-sequence rule. By the time any stealer sees the new_bottomvalue, the item is already visible. No spin-wait is needed. No null sentinel is needed. No slot clearing after consumption is needed.Why fully lock-free push is not viable with dynamic resize
Three approaches were implemented and stress-tested before reaching this design. Each fails due to a structural race between the array pointer and concurrent writes.
Attempt 1: CAS on
_bottom, re-read_arrayin steal.The pusher CAS-claims slot
b, then loads_arrayand writes to it. A resize can install a new array after the CAS but before the write. The resize copies null for slotb(nothing written yet) into the new array. The pusher writes to the old array. The stealer loads the new array and sees null at slotbforever.Attempt 2: Resize spins on null slots before copying.
Resize spin-waits for each null slot to become non-null before copying it. But after resize publishes the new array, the pusher re-reads
_arrayand now holds a pointer to the new array, writing its item there. Resize is spinning on the old array waiting for a write that went to the new one. Deadlock.Attempt 3: Stable-write loop.
After writing, the pusher checks if
_arraychanged and rewrites if so. A second resize can occur between the check and the rewrite, installing a third array that copied null from the second. The stealer reads the third array and sees null.Root cause: resize and multi-producer push share the array pointer. Any operation that reads the pointer and then writes based on that read can be invalidated by a resize between the read and the write. Mutual exclusion on the write path is structurally required. The steal path has no such constraint and remains lock-free.
Spinlock contention profile
The spinlock contends in exactly two situations:
The steal path never touches the spinlock.
Changes
taskflow/core/wsq.hppstd::atomic_flag _resize_lockmember.push(T): acquire spinlock, resize if needed, write item,_bottom.store(release), release.bulk_push(I, N): acquire spinlock, resize loop until N items fit, write N items,_bottom.store(release), release.steal(),pop(),steal_with_feedback(): unchanged from original Chase-Lev.Array::clear()and null-sentinel machinery.taskflow/core/executor.hppstd::mutex mutexfromBuffer.std::scoped_lockin_spilland_bulk_spill.unittests/test_mpmc_wsq.cpp(new)bulk_push, near-empty thrashing, mixed bulk and single push, deadlock detection, executor integration, and shutdown.Test results
All tests pass with zero warnings under MSVC
/W4 /WX /std:c++20 /O2.test_wsq(regression)test_work_stealing(regression)test_mpmc_wsq(new stress)The stress tests cover interleavings the existing regression suite cannot reach because those tests were written for single-producer semantics:
PurePushContention: 16 threads, 1 million items each, uniqueness verified after drain.ConcurrentPushSteal: 8 producers, 8 stealers, 10 million items, completeness and uniqueness verified.ResizeUnderContention: queue starts at capacity 8, 4 producers and 4 stealers force dozens of array doublings under concurrent access.BurstBulkPush: singlebulk_pushof 1 million items with 15 simultaneous stealers, 5 rounds. Models the OpenTimer workload.NearEmptyThrashing: 8 stealers race on a queue with one item at a time, 100,000 iterations.MixedBulkAndSinglePush: 4bulk_pushthreads and 4 single-pushthreads compete on the same_bottom, 5 million total items.ResizeLockNoDeadlock: 8 threads push without stealers consuming, forcing continuous resize. Timeout fires if the spinlock deadlocks.ExecutorIntegration:tf::Executorwith 10,000 root tasks each spawning 100 sub-tasks, 5 rounds.ShutdownDuringPushSteal: executor destroyed while tasks are being pushed and stolen. Verifies clean shutdown.