Skip to content

perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#796

Open
wicyn wants to merge 2 commits into
taskflow:masterfrom
wicyn:master
Open

perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#796
wicyn wants to merge 2 commits into
taskflow:masterfrom
wicyn:master

Conversation

@wicyn

@wicyn wicyn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor
       downgrade BoundedWSQ::steal() top load and pop() CAS to relaxed/acq_rel

UnboundedWSQ is used only as the executor's overflow region: the owner only pushes and thieves only steal — the owner never pops. With pop() gone as dead code, the cross-variable Dekker mutual-exclusion structure between top and bottom disappears, and that structure was the sole reason steal() needed a seq_cst fence and a seq_cst CAS. The downgrades follow from this:

UnboundedWSQ::steal()

  • Remove the seq_cst fence: it existed only to pair with pop()'s fence in the single total order S and arbitrate the "last element" race. With no pop(), the fence has nothing to pair with.
  • CAS success order seq_cst -> acq_rel: the only remaining contention is thief-vs-thief on the single variable top. CAS RMW atomicity guarantees exactly one thief wins, and release/acquire is sufficient to chain top's modification order; no cross-variable total order is required.
  • _top.load acquire -> relaxed: top increases monotonically, so relaxed can at worst read a smaller, stale value, making the t<b check conservative (one missed steal, never out of bounds); the CAS then re-validates against the latest top and returns empty on failure. Slot-data visibility is carried by _bottom.load(acquire) paired with push()'s release, independent of how top is read.

BoundedWSQ::steal()

  • Only _top.load acquire -> relaxed. BoundedWSQ still has pop(), so the Dekker structure remains and both the seq_cst fence and the seq_cst CAS are kept unchanged. Relaxing the top load is safe: the seq_cst fence immediately following it is a full barrier that already prevents the top load from being reordered with the subsequent bottom load and slot load — the LoadLoad reordering that acquire was meant to block is already covered by the fence. Slot visibility likewise depends on bottom's acquire chain, not on how top is read.

BoundedWSQ::pop()

  • CAS success order seq_cst -> acq_rel. The seq_cst fence in pop() is retained, and so is steal()'s seq_cst fence and steal()'s seq_cst CAS. The argument that pop()'s CAS need not enter S: * The Dekker / SB exclusion that prevents fast-path double-take is carried entirely by the two seq_cst fences plus steal()'s seq_cst CAS anchor. pop()'s own side is cleanly SB-shaped (store bottom; seq_cst fence; load top), so its fence does the SB duty on its own — pop()'s CAS is not the anchor for either fence. * pop()'s CAS executes only on the t==b (last-element) path, i.e. never on the fast path. That path is single-variable contention: owner-CAS vs thief-CAS both move top from t to t+1, and RMW atomicity guarantees exactly one winner regardless of memory order. acq_rel provides the acquire needed to observe the loser/winner state and the release to publish the claim; entry into the global total order S is not required. Net effect: pop() keeps its seq_cst fence (load-bearing) but drops one seq_cst RMW on the cold last-element path.

Tests (wsq_test.cpp)

  • UnboundedWSQ tests converted to steal-only; all UnboundedWSQ::pop() calls removed. BoundedWSQ tests are untouched (BoundedWSQ still has pop()).
  • UnboundedWSQ.Resize: drain loop pop() -> steal().
  • UnboundedWSQ.Owner / ValueType: dropped the LIFO pop() verification; the pre-existing steal() FIFO check now covers the single-thread path. Removed the now-duplicated second push round.
  • UnboundedWSQ.*Consumers / *.BulkPush: dropped the "master/bulk push and pop" LIFO blocks; in the concurrent phase the owner only pushes and waits (while(consumed != N) yield()) for thieves to drain, since a steal-only owner cannot pop. items is now collected solely from stolens, and the trailing pop()==nullptr assertion is removed (steal()==nullptr kept).

           downgrade BoundedWSQ::steal() top load and pop() CAS to relaxed/acq_rel

UnboundedWSQ is used only as the executor's overflow region: the owner only
pushes and thieves only steal — the owner never pops. With pop() gone as dead
code, the cross-variable Dekker mutual-exclusion structure between top and
bottom disappears, and that structure was the sole reason steal() needed a
seq_cst fence and a seq_cst CAS. The downgrades follow from this:

UnboundedWSQ::steal()
  - Remove the seq_cst fence: it existed only to pair with pop()'s fence in the
    single total order S and arbitrate the "last element" race. With no pop(),
    the fence has nothing to pair with.
  - CAS success order seq_cst -> acq_rel: the only remaining contention is
    thief-vs-thief on the single variable top. CAS RMW atomicity guarantees
    exactly one thief wins, and release/acquire is sufficient to chain top's
    modification order; no cross-variable total order is required.
  - _top.load acquire -> relaxed: top increases monotonically, so relaxed can
    at worst read a smaller, stale value, making the t<b check conservative
    (one missed steal, never out of bounds); the CAS then re-validates against
    the latest top and returns empty on failure. Slot-data visibility is
    carried by _bottom.load(acquire) paired with push()'s release, independent
    of how top is read.

BoundedWSQ::steal()
  - Only _top.load acquire -> relaxed. BoundedWSQ still has pop(), so the Dekker
    structure remains and both the seq_cst fence and the seq_cst CAS are kept
    unchanged. Relaxing the top load is safe: the seq_cst fence immediately
    following it is a full barrier that already prevents the top load from being
    reordered with the subsequent bottom load and slot load — the LoadLoad
    reordering that acquire was meant to block is already covered by the fence.
    Slot visibility likewise depends on bottom's acquire chain, not on how top
    is read.

BoundedWSQ::pop()
  - CAS success order seq_cst -> acq_rel. The seq_cst fence in pop() is retained,
    and so is steal()'s seq_cst fence and steal()'s seq_cst CAS. The argument
    that pop()'s CAS need not enter S:
      * The Dekker / SB exclusion that prevents fast-path double-take is carried
        entirely by the two seq_cst fences plus steal()'s seq_cst CAS anchor.
        pop()'s own side is cleanly SB-shaped (store bottom; seq_cst fence;
        load top), so its fence does the SB duty on its own — pop()'s CAS is not
        the anchor for either fence.
      * pop()'s CAS executes only on the t==b (last-element) path, i.e. never on
        the fast path. That path is single-variable contention: owner-CAS vs
        thief-CAS both move top from t to t+1, and RMW atomicity guarantees
        exactly one winner regardless of memory order. acq_rel provides the
        acquire needed to observe the loser/winner state and the release to
        publish the claim; entry into the global total order S is not required.
    Net effect: pop() keeps its seq_cst fence (load-bearing) but drops one
    seq_cst RMW on the cold last-element path.

Tests (wsq_test.cpp)
  - UnboundedWSQ tests converted to steal-only; all UnboundedWSQ::pop() calls
    removed. BoundedWSQ tests are untouched (BoundedWSQ still has pop()).
  - UnboundedWSQ.Resize: drain loop pop() -> steal().
  - UnboundedWSQ.Owner / ValueType: dropped the LIFO pop() verification; the
    pre-existing steal() FIFO check now covers the single-thread path. Removed
    the now-duplicated second push round.
  - UnboundedWSQ.*Consumers / *.BulkPush: dropped the "master/bulk push and pop"
    LIFO blocks; in the concurrent phase the owner only pushes and waits
    (while(consumed != N) yield()) for thieves to drain, since a steal-only
    owner cannot pop. `items` is now collected solely from `stolens`, and the
    trailing pop()==nullptr assertion is removed (steal()==nullptr kept).
@tsung-wei-huang

Copy link
Copy Markdown
Member

@wicyn thank you for the contribution! Yes, this is a good insight. However, I suggest we don't modify directly UnboundedWSQ but make this contribution a separate class specifically designed for Taskflow's use case, say OverflowWSQ. This is because UnboundedWSQ may have other use cases (maybe not now) that need pop. We don't even need to modify existing wsq tests either but add new for OverflowWSQ. What do you think?

@wicyn wicyn force-pushed the master branch 2 times, most recently from d5ddd24 to af2908a Compare June 16, 2026 03:09
…tor overflow

OverflowWSQ
  - Steal-only: owner has push()/bulk_push() but no pop(). The executor's
    overflow region never pops from the owner end.
  - With no pop(), the cross-variable Dekker mutual-exclusion between top and
    bottom disappears, which is the sole reason a seq_cst fence + seq_cst CAS
    were required in steal(). steal() is therefore lighter than UnboundedWSQ's:
      * no seq_cst fence (it only existed to pair with pop()'s fence);
      * CAS success order acq_rel instead of seq_cst — remaining contention is
        thief-vs-thief on the single variable top, where RMW atomicity picks
        the unique winner and release/acquire suffices to chain top's
        modification order; no entry into the global total order S is needed;
      * _top.load relaxed (top is monotonic; a stale-small read only makes the
        t<b check conservative, never out of bounds; the CAS re-validates).
    Slot-data visibility is carried by _bottom.load(acquire) paired with
    push()'s release, independent of how top is read.
  - Buffer lifetime identical to UnboundedWSQ: resize() parks the old Array in
    _garbage, released only at destruction — a thief holding a stale _array
    pointer still reads live, value-correct memory. No use-after-free.

UnboundedWSQ
  - Untouched. pop() and the original seq_cst orderings are preserved for any
    current or future use case that needs owner-side LIFO pop.

Tests
  - UnboundedWSQ / BoundedWSQ tests untouched.
  - Add dedicated OverflowWSQ tests.
@wicyn

wicyn commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@wicyn thank you for the contribution! Yes, this is a good insight. However, I suggest we don't modify directly UnboundedWSQ but make this contribution a separate class specifically designed for Taskflow's use case, say OverflowWSQ. This is because UnboundedWSQ may have other use cases (maybe not now) that need pop. We don't even need to modify existing wsq tests either but add new for OverflowWSQ. What do you think?

Hi @tsung-wei-huang, I have applied the requested changes to the PR. Please let me know if you have any further feedback or if anything remains unclear. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants