perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#796
perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#796wicyn wants to merge 2 commits into
Conversation
downgrade BoundedWSQ::steal() top load and pop() CAS to relaxed/acq_rel
UnboundedWSQ is used only as the executor's overflow region: the owner only
pushes and thieves only steal — the owner never pops. With pop() gone as dead
code, the cross-variable Dekker mutual-exclusion structure between top and
bottom disappears, and that structure was the sole reason steal() needed a
seq_cst fence and a seq_cst CAS. The downgrades follow from this:
UnboundedWSQ::steal()
- Remove the seq_cst fence: it existed only to pair with pop()'s fence in the
single total order S and arbitrate the "last element" race. With no pop(),
the fence has nothing to pair with.
- CAS success order seq_cst -> acq_rel: the only remaining contention is
thief-vs-thief on the single variable top. CAS RMW atomicity guarantees
exactly one thief wins, and release/acquire is sufficient to chain top's
modification order; no cross-variable total order is required.
- _top.load acquire -> relaxed: top increases monotonically, so relaxed can
at worst read a smaller, stale value, making the t<b check conservative
(one missed steal, never out of bounds); the CAS then re-validates against
the latest top and returns empty on failure. Slot-data visibility is
carried by _bottom.load(acquire) paired with push()'s release, independent
of how top is read.
BoundedWSQ::steal()
- Only _top.load acquire -> relaxed. BoundedWSQ still has pop(), so the Dekker
structure remains and both the seq_cst fence and the seq_cst CAS are kept
unchanged. Relaxing the top load is safe: the seq_cst fence immediately
following it is a full barrier that already prevents the top load from being
reordered with the subsequent bottom load and slot load — the LoadLoad
reordering that acquire was meant to block is already covered by the fence.
Slot visibility likewise depends on bottom's acquire chain, not on how top
is read.
BoundedWSQ::pop()
- CAS success order seq_cst -> acq_rel. The seq_cst fence in pop() is retained,
and so is steal()'s seq_cst fence and steal()'s seq_cst CAS. The argument
that pop()'s CAS need not enter S:
* The Dekker / SB exclusion that prevents fast-path double-take is carried
entirely by the two seq_cst fences plus steal()'s seq_cst CAS anchor.
pop()'s own side is cleanly SB-shaped (store bottom; seq_cst fence;
load top), so its fence does the SB duty on its own — pop()'s CAS is not
the anchor for either fence.
* pop()'s CAS executes only on the t==b (last-element) path, i.e. never on
the fast path. That path is single-variable contention: owner-CAS vs
thief-CAS both move top from t to t+1, and RMW atomicity guarantees
exactly one winner regardless of memory order. acq_rel provides the
acquire needed to observe the loser/winner state and the release to
publish the claim; entry into the global total order S is not required.
Net effect: pop() keeps its seq_cst fence (load-bearing) but drops one
seq_cst RMW on the cold last-element path.
Tests (wsq_test.cpp)
- UnboundedWSQ tests converted to steal-only; all UnboundedWSQ::pop() calls
removed. BoundedWSQ tests are untouched (BoundedWSQ still has pop()).
- UnboundedWSQ.Resize: drain loop pop() -> steal().
- UnboundedWSQ.Owner / ValueType: dropped the LIFO pop() verification; the
pre-existing steal() FIFO check now covers the single-thread path. Removed
the now-duplicated second push round.
- UnboundedWSQ.*Consumers / *.BulkPush: dropped the "master/bulk push and pop"
LIFO blocks; in the concurrent phase the owner only pushes and waits
(while(consumed != N) yield()) for thieves to drain, since a steal-only
owner cannot pop. `items` is now collected solely from `stolens`, and the
trailing pop()==nullptr assertion is removed (steal()==nullptr kept).
|
@wicyn thank you for the contribution! Yes, this is a good insight. However, I suggest we don't modify directly |
d5ddd24 to
af2908a
Compare
…tor overflow
OverflowWSQ
- Steal-only: owner has push()/bulk_push() but no pop(). The executor's
overflow region never pops from the owner end.
- With no pop(), the cross-variable Dekker mutual-exclusion between top and
bottom disappears, which is the sole reason a seq_cst fence + seq_cst CAS
were required in steal(). steal() is therefore lighter than UnboundedWSQ's:
* no seq_cst fence (it only existed to pair with pop()'s fence);
* CAS success order acq_rel instead of seq_cst — remaining contention is
thief-vs-thief on the single variable top, where RMW atomicity picks
the unique winner and release/acquire suffices to chain top's
modification order; no entry into the global total order S is needed;
* _top.load relaxed (top is monotonic; a stale-small read only makes the
t<b check conservative, never out of bounds; the CAS re-validates).
Slot-data visibility is carried by _bottom.load(acquire) paired with
push()'s release, independent of how top is read.
- Buffer lifetime identical to UnboundedWSQ: resize() parks the old Array in
_garbage, released only at destruction — a thief holding a stale _array
pointer still reads live, value-correct memory. No use-after-free.
UnboundedWSQ
- Untouched. pop() and the original seq_cst orderings are preserved for any
current or future use case that needs owner-side LIFO pop.
Tests
- UnboundedWSQ / BoundedWSQ tests untouched.
- Add dedicated OverflowWSQ tests.
Hi @tsung-wei-huang, I have applied the requested changes to the PR. Please let me know if you have any further feedback or if anything remains unclear. Thanks! |
UnboundedWSQ is used only as the executor's overflow region: the owner only pushes and thieves only steal — the owner never pops. With pop() gone as dead code, the cross-variable Dekker mutual-exclusion structure between top and bottom disappears, and that structure was the sole reason steal() needed a seq_cst fence and a seq_cst CAS. The downgrades follow from this:
UnboundedWSQ::steal()
BoundedWSQ::steal()
BoundedWSQ::pop()
Tests (wsq_test.cpp)
itemsis now collected solely fromstolens, and the trailing pop()==nullptr assertion is removed (steal()==nullptr kept).