Skip to content

runtime: steal tasks from the LIFO slot#7431

Merged
Darksonn merged 26 commits intomasterfrom
eliza/lifo-steal
Apr 3, 2026
Merged

runtime: steal tasks from the LIFO slot#7431
Darksonn merged 26 commits intomasterfrom
eliza/lifo-steal

Conversation

@hawkw
Copy link
Copy Markdown
Member

@hawkw hawkw commented Jun 27, 2025

Motivation

Worker threads in the multi-threaded runtime include a per-worker LIFO
slot which stores the last task notified by another task running on that
worker. This allows the last-notified task to be polled first when the
currently running task completes, decreasing latency in message-passing
"ping-pong" scenarios.

However --- as described in #4941 --- there's an issue with this that
can cause severe problems in some situations: the task in the LIFO slot
cannot be stolen by a work-stealing worker thread. This means that if a
task notifies another task and then goes CPU bound for a long period of
time without yielding, the notified task will never be able to execute
until the task that notified it can yield. This can result in a very
severe latency bubble in some scenarios. See, for instance, #4323,
#6954, oxidecomputer/omicron#8334, etc.

As a workaround, PR #4936 added an unstable runtime::Builder option
to disable the LIFO slot
. However, this is a less-than-ideal
solution, as it means that applications which disable the LIFO slot due
to occasional usage patterns that cause latency bubbles when it is
enabled cannot benefit from the potential latency improvements it
offers in other usage patterns. And, it's an unstable option which the
user has to discover. In most cases, people whose programs contain usage
patterns that are pathological with regards to the LIFO slot don't know
this ahead of time: the typical narrative is that you write code that
happens to follow such a pattern, discover an unexpected latency spike
or hang in production, and then learn how to disable the LIFO slot. It
would be much nicer if the task in the LIFO slot could participate in
work-stealing like every other task.

Solution

This branch makes tasks in the LIFO slot stealable.

Broadly, I've taken the following approach:

  1. Add a new AtomicNotified type that implements an atomically
    swappable cell storing a Notified task
    (1220253), and use it to represent
    the LIFO slot instead of an Option<Notified<Arc<Handle>>>. This
    way, other workers can take a task out of the LIFO slot while
    work-stealing.

  2. Move the LIFO slot out of the worker::Core struct and into the run
    queue's Inner type (75d8116),
    making it shared state between the Local side of the queue owned by
    the worker itself and the Steal side used by remote workers to
    steal from the queue.

    There's a bunch of additional code in worker::Core for managing
    whether to atually run a task from the LIFO slot or not. I opted
    not to move any of this code into the run queue itself, as it
    depends on other bits of internal worker state. Instead, we just
    expose to the worker separate APIs for pushing/popping to/from the
    main queue and for pushing/popping to/from the LIFO slot, resulting
    in a fairly small diff to the worker's run loop.

  3. Change the work-stealing code to also steal the LIFO task
    (730a581 and
    cb27dda). This actually turned out
    to be pretty straightforwards: once we've stolen a chunk of tasks
    from the targeted worker's run queue, we now also grab whatever's in
    its' LIFO slot as well. If we stole a LIFO task, it's returned from
    the steal_into method in lieu of the first task in the run queue, so
    that it gets to execute first, maintaining the latency improvement for
    recently-notified tasks. This also was simpler than trying to wedge it
    into the chunk of tasks to be pushed to the stealer's queue.

I've made the following test changes while working on this branch:

  • Added an integration test that reproduces a scenario where a task in
    the LIFO slot is blocked from running when a task notifies it and then
    blocks indefinitely. I've confirmed that this test fails on master
    and passes on this branch.
  • Added new Loom tests for work-stealing involving the LIFO slot.
    These are in addition to the existing work-stealing Loom tests, as
    tasks notified by I/O or timers are still woken to the normal run
    queue.
  • A small change to the
    rt_unstable_metrics::worker_local_queue_depth integration test,
    which was necessary as tasks in the LIFO slot now "count" towards the
    worker's queue depth. We now have to make sure the no-op task that's
    spawned has completed before spawning the tasks we actually attempt to
    count, as it seems to sometimes end up in the LIFO slot and sometimes
    not, causing the test to flake out.

Fixes #4941

hawkw added 3 commits June 27, 2025 14:15
This commit changes the LIFO slot on multi-threaded runtime workers from
being a mutable `Option<Notified<Arc<Handle>>>` to a new
`AtomicNotified` type. This
type implements a cell containing a nullable task pointer which can
be swapped atomically. It's analogous to `AtomicCell` but with the extra
`PhantomData` to remember the task's scheduler type parameter, which
would otherwise be erased by the conversion into a `*mut Header``
pointer.

This change is in preparation for a subsequent change to allow
work-stealing from the LIFO slot (see: #4941).
This way, it's accesible by the stealer. Leave all the LIFO *accounting*
(i.e. deciding whether we hit the LIFO slot or not) up to the worker.

Gotta figure out whether the load of lifo presence will race...ugh.
@github-actions github-actions Bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt labels Jun 27, 2025
@hawkw hawkw removed the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025
@github-actions github-actions Bot added the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025
This commit adds a test ensuring that if a task is notified to the LIFO
slot by another task which then blocks the worker thread forever, the
LIFO task is eventually stolen by another worker. I've confimed that
this test fails on the `master` branch, and passes after these changes.
hawkw and others added 5 commits June 30, 2025 12:53
This test spawns a task that sometimes ends up in the LIFO slot and
sometimes doesn't. This was previously fine as the LIFO slot didn't
count for `worker_local_queue_depth`, but now it does. Thus, we have to
make sure that task no longer exists before asserting about queue depth.
@hawkw hawkw changed the title [WIP] runtime: steal from the LIFO slot runtime: steal tasks from the LIFO slot Jul 1, 2025
@hawkw hawkw self-assigned this Jul 1, 2025
@hawkw hawkw added C-enhancement Category: A PR with an enhancement or bugfix. A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jul 1, 2025
@hawkw hawkw marked this pull request as ready for review July 1, 2025 19:39
@hawkw hawkw requested review from ADD-SP, Darksonn and carllerche July 1, 2025 19:39
@hawkw
Copy link
Copy Markdown
Member Author

hawkw commented Jul 1, 2025

(shoutout to @mkeeter for nerd-sniping me into actually doing this)

Comment on lines +463 to +465
// If we also grabbed the task from the LIFO slot, include that in the
// steal count as well.
dst_stats.incr_steal_count(n as u16 + lifo.is_some() as u16);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: Rust has defined the casting from bool to u16.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also possible to use u16::from(bool), if you care about cast_lossless cleanliness.

Comment thread tokio/src/runtime/scheduler/multi_thread/worker.rs Outdated
@Darksonn Darksonn merged commit eeb55c7 into master Apr 3, 2026
94 checks passed
@Darksonn Darksonn deleted the eliza/lifo-steal branch April 3, 2026 07:12
zeenix pushed a commit to z-galaxy/busd that referenced this pull request Apr 9, 2026
Apply tokio update from #315 to test whether tokio 1.51.0's LIFO slot
stealing change (tokio-rs/tokio#7431) is what was triggering the test
flakiness that the port-conflict fix addresses.

https://claude.ai/code/session_015hha9ShFQ1MdZrkQJMWTFH
zeenix pushed a commit to z-galaxy/busd that referenced this pull request Apr 9, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump.

Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the
socket-reader task zbus spawns inside `Connection::Builder::build()` can
start running on another worker *before* `build()` returns to busd.
If it reads a pipelined `Hello` and broadcasts it before busd has had a
chance to call `MessageStream::from(peer.conn())` (which is what
activates a receiver on the unfiltered broadcast channel), the message
is silently dropped — zbus's socket reader swallows
`TrySendError::Inactive` for the generic channel. The affected client
then hangs forever waiting for its `Hello` reply.

The fix lives in zbus: `connection::Builder::build_message_stream`
activates a receiver before the socket reader task is spawned, so no
messages can be lost in that race window. See z-galaxy/zbus#1760.

On the busd side:

- `Peer::new` now returns `(Peer, Stream)` built via
  `build_message_stream`, with `Connection::from(&stream)` used to grab
  the connection for the `Peer` struct.
- `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so
  the self-dial peer gets the same race-free treatment.
- `Peers::{add, add_us}` destructure the pair and drop the now-dead
  `peer.stream()` call.
- `src/bus/mod.rs` builds the self-dial `peer_stream` via
  `build_message_stream` too.

A `[patch.crates-io]` entry pins zbus to the PR commit; it will be
removed once a zbus release containing `build_message_stream` is out.
zeenix pushed a commit to z-galaxy/busd that referenced this pull request Apr 9, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump.

Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the
socket-reader task zbus spawns inside `Connection::Builder::build()` can
start running on another worker *before* `build()` returns to busd.
If it reads a pipelined `Hello` and broadcasts it before busd has had a
chance to call `MessageStream::from(peer.conn())` (which is what
activates a receiver on the unfiltered broadcast channel), the message
is silently dropped — zbus's socket reader swallows
`TrySendError::Inactive` for the generic channel. The affected client
then hangs forever waiting for its `Hello` reply.

The fix lives in zbus: `connection::Builder::build_message_stream`
activates a receiver before the socket reader task is spawned, so no
messages can be lost in that race window. See z-galaxy/zbus#1760.

On the busd side:

- `Peer::new` now returns `(Peer, Stream)` built via
  `build_message_stream`, with `Connection::from(&stream)` used to grab
  the connection for the `Peer` struct.
- `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so
  the self-dial peer gets the same race-free treatment.
- `Peers::{add, add_us}` destructure the pair and drop the now-dead
  `peer.stream()` call.
- `src/bus/mod.rs` builds the self-dial `peer_stream` via
  `build_message_stream` too.

A `[patch.crates-io]` entry pins zbus to the PR commit; it will be
removed once a zbus release containing `build_message_stream` is out.
zeenix pushed a commit to z-galaxy/busd that referenced this pull request Apr 10, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump.

Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the
socket-reader task zbus spawns inside `Connection::Builder::build()` can
start running on another worker *before* `build()` returns to busd.
If it reads a pipelined `Hello` and broadcasts it before busd has had a
chance to call `MessageStream::from(peer.conn())` (which is what
activates a receiver on the unfiltered broadcast channel), the message
is silently dropped — zbus's socket reader swallows
`TrySendError::Inactive` for the generic channel. The affected client
then hangs forever waiting for its `Hello` reply.

The fix lives in zbus: `connection::Builder::build_message_stream`
activates a receiver before the socket reader task is spawned, so no
messages can be lost in that race window. See z-galaxy/zbus#1760.

On the busd side:

- `Peer::new` now returns `(Peer, Stream)` built via
  `build_message_stream`, with `Connection::from(&stream)` used to grab
  the connection for the `Peer` struct.
- `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so
  the self-dial peer gets the same race-free treatment.
- `Peers::{add, add_us}` destructure the pair and drop the now-dead
  `peer.stream()` call.
- `src/bus/mod.rs` builds the self-dial `peer_stream` via
  `build_message_stream` too.

A `[patch.crates-io]` entry pins zbus to the PR commit; it will be
removed once a zbus release containing `build_message_stream` is out.
zeenix pushed a commit to z-galaxy/busd that referenced this pull request Apr 10, 2026
Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the
socket-reader task zbus spawns inside `Connection::Builder::build()` can
start running on another worker before `build()` returns to busd. If it
reads a pipelined `Hello` and broadcasts it before busd has activated a
receiver via `MessageStream::from(peer.conn())`, the message is silently
dropped — the affected client hangs forever waiting for its reply.

Use the new `Builder::build_message_stream` (z-galaxy/zbus#1760) which
activates a receiver before the socket-reader task is spawned, closing
the race window entirely.

- `Peer::new` returns `(Peer, Stream)` built via `build_message_stream`,
  with `Connection::from(&stream)` to extract the connection.
- `Peer::new_us` takes a `MessageStream` instead of a `Connection`.
- `Peers::{add, add_us}` destructure the pair; the now-unused
  `Peer::stream()` accessor is removed.
- `bus::Bus::for_address` builds the self-dial peer stream the same way.
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 14, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to
pick up two major fixes that will change our default runtime
configuration:

- tokio-rs/tokio#8010 (released in [1.52.0]) which fixes
  oxidecomputer/omicron#9619 when its builder option is enabled,
- tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the
  LIFO slot to participate in work-stealing.

Subsequent commits will actually update our runtime settings after
picking up these releases.

[1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0
[1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 14, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the
multi-threaded runtime to allow tasks in the LIFO slot to participate in
work-stealing. Therefore, it should no longer be necessary to disable
the LIFO slot optimization, as the apathology where a task placed in the
LIFO slot can become permanently or semi-permanently stuck while the
task that notified them runs for a long time without yielding can no
longer occur.
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 15, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to
pick up two major fixes that will change our default runtime
configuration:

- tokio-rs/tokio#8010 (released in [1.52.0]) which fixes
  oxidecomputer/omicron#9619 when its builder option is enabled,
- tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the
  LIFO slot to participate in work-stealing.

Subsequent commits will actually update our runtime settings after
picking up these releases.

[1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0
[1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 15, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the
multi-threaded runtime to allow tasks in the LIFO slot to participate in
work-stealing. Therefore, it should no longer be necessary to disable
the LIFO slot optimization, as the apathology where a task placed in the
LIFO slot can become permanently or semi-permanently stuck while the
task that notified them runs for a long time without yielding can no
longer occur.
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 15, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to
pick up two major fixes that will change our default runtime
configuration:

- tokio-rs/tokio#8010 (released in [1.52.0]) which fixes
  oxidecomputer/omicron#9619 when its builder option is enabled,
- tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the
  LIFO slot to participate in work-stealing.

Subsequent commits will actually update our runtime settings after
picking up these releases.

[1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0
[1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw added a commit to oxidecomputer/oxide-tokio-rt that referenced this pull request Apr 15, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the
multi-threaded runtime to allow tasks in the LIFO slot to participate in
work-stealing. Therefore, it should no longer be necessary to disable
the LIFO slot optimization, as the apathology where a task placed in the
LIFO slot can become permanently or semi-permanently stuck while the
task that notified them runs for a long time without yielding can no
longer occur.
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Apr 15, 2026
update `oxide-tokio-rt` to v0.1.4, `tokio` to v1.52.0

This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio
v1.52.0 and the corresponding changes to the default runtime settings in
`oxide-tokio-rt`. In particular, this allows us to pick up two of my
upstream fixes in Tokio for a pair of issues that have been major thorns
in our side for some time:

* Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0],  changes
  the multi-threaded runtime to allow tasks in the LIFO slot to
  participate in work-stealing. Therefore, it should no longer be
  necessary to disable the LIFO slot optimization, as the pathology
  described in #8334, where a task placed in the LIFO slot can become
  permanently or semi-permanently stuck while the task that notified
  them runs for a long time without yielding, can no longer occur.
  `oxide-tokio-rt` v0.1.4 removes the runtime configuration to disable
  the LIFO slot as the issue has been fixed upstream.

* Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds
  eager handoff for the I/O and time drivers in the multi-threaded
  runtime. This is currently an experimental feature, although it is
  your author's opinion that this is really a fix for incorrect runtime
  behavior. It changes worker threads in the multi-threaded runtime to
  wake another worker prior to polling tasks if that worker had
  previously been parked on the I/O driver or timer wheel. Eagerly
  handing off these resources should prevent pathologies such as #9619.
  `oxide-tokio-rt` v0.1.4 enables this behavior by default.

Fixes #8334
Fixes #9619

[Tokio v1.52.0]:
https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0
[Tokio v1.51.0]:
https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Apr 15, 2026
This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio
v1.52.0 and the corresponding changes to the default runtime settings in
`oxide-tokio-rt`. In particular, this allows us to pick up two of my
upstream fixes in Tokio for a pair of issues that have been major thorns
in our side for some time:

* Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0], changes the
  multi-threaded runtime to allow tasks in the LIFO slot to participate
  in work-stealing. Therefore, it should no longer be necessary to
  disable the LIFO slot optimization, as the pathology described in
  #8334, where a task placed in the LIFO slot can become permanently or
  semi-permanently stuck while the task that notified them runs for a
  long time without yielding, can no longer occur. `oxide-tokio-rt`
  v0.1.4 removes the runtime configuration to disable the LIFO slot as
  the issue has been fixed upstream.

* Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds
  eager handoff for the I/O and time drivers in the multi-threaded
  runtime. This is currently an experimental feature, although it is
  your author's opinion that this is really a fix for incorrect runtime
  behavior.[^1] It changes worker threads in the multi-threaded runtime
  to wake another worker prior to polling tasks if that worker had
  previously been parked on the I/O driver or timer wheel. Eagerly
  handing off these resources should prevent pathologies such as #9619.
  `oxide-tokio-rt` v0.1.4 enables this behavior by default.

Fixes #8334
Fixes #9619

[Tokio v1.52.0]:
  https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0
[Tokio v1.51.0]:
  https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0

[^1]: Cue @leftwo's dictum that a "performance regression" from fixing
      incorrect behavior...isn't a performance regression.
hawkw added a commit to oxidecomputer/omicron that referenced this pull request Apr 16, 2026
Commit #10272 updated our dependency on `oxide-tokio-rt` to v0.1.4 and
our `tokio` dependency to v1.52.0. This allowed us to pick up two of my
fixes for Tokio issues that have been a thorn in our side for a long
time, tokio-rs/tokio#7431 and tokio-rs/tokio#8010, which fix #8334 and
#9619, respectively. The nature of these fixes is described in greater
detail in #10272.

Unfortunately, #10272 had to be reverted (in #10279), since @iliana
discovered an unrelated regression in Tokio v1.52.0, tokio-rs/tokio#8056
(our issue #10277). This regression caused `spawn_blocking` to
occasionally hang, and was introduced in
tokio-rs/tokio@1604bc3 (PR
tokio-rs/tokio#7757).

I've since reverted this change upstream (tokio-rs/tokio#8057), and
published a patch release ([v1.52.1]), which fixes the regression.
Therefore, it is now once again safe to update our Tokio dependency to
pick up the other fixes. This commit does that. I've also confirmed that
the issue described in #10277 is no longer present in Tokio v1.52.1, as
demonstrated by the fact `cargo nextest run -p omicron-sled-agent
--stress-count 100 -- --exact artifact_store::test::issue_7796` now
succeeds without hanging once again.

Fixes #10272
hawkw added a commit that referenced this pull request Apr 20, 2026
Currently, the `rt_threaded::lifo_stealable` test I added in #7431
spawns an additional task which sleeps on a 4ms timer in a loop. This
ensures that no worker remains permanently parked. This was added
because it was necessary to stop the LIFO slot deadlock from occurring
prior to changes in the logic for determining whether to notify another
worker, which is what @Darksonn  was referring to in [this comment][1].
Removing the `churn()` test makes the test actually validate that
another worker is notified to steal the LIFO task, and that the changes
from #7431 will *always* prevent a LIFO slot deadlock, regardless of the
behavior of other tasks on the runtime. See also [this comment][2] for
further discussion.

[1]: #7431 (comment)
[2]: #8069 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate C-enhancement Category: A PR with an enhancement or bugfix. M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rt: make the LIFO slot in the multi-threaded scheduler stealable

5 participants