runtime: steal tasks from the LIFO slot#7431
Merged
Conversation
This commit changes the LIFO slot on multi-threaded runtime workers from being a mutable `Option<Notified<Arc<Handle>>>` to a new `AtomicNotified` type. This type implements a cell containing a nullable task pointer which can be swapped atomically. It's analogous to `AtomicCell` but with the extra `PhantomData` to remember the task's scheduler type parameter, which would otherwise be erased by the conversion into a `*mut Header`` pointer. This change is in preparation for a subsequent change to allow work-stealing from the LIFO slot (see: #4941).
This way, it's accesible by the stealer. Leave all the LIFO *accounting* (i.e. deciding whether we hit the LIFO slot or not) up to the worker. Gotta figure out whether the load of lifo presence will race...ugh.
This commit adds a test ensuring that if a task is notified to the LIFO slot by another task which then blocks the worker thread forever, the LIFO task is eventually stolen by another worker. I've confimed that this test fails on the `master` branch, and passes after these changes.
This test spawns a task that sometimes ends up in the LIFO slot and sometimes doesn't. This was previously fine as the LIFO slot didn't count for `worker_local_queue_depth`, but now it does. Thus, we have to make sure that task no longer exists before asserting about queue depth.
Member
Author
|
(shoutout to @mkeeter for nerd-sniping me into actually doing this) |
ADD-SP
reviewed
Jul 2, 2025
Comment on lines
+463
to
+465
| // If we also grabbed the task from the LIFO slot, include that in the | ||
| // steal count as well. | ||
| dst_stats.incr_steal_count(n as u16 + lifo.is_some() as u16); |
Member
There was a problem hiding this comment.
It's also possible to use u16::from(bool), if you care about cast_lossless cleanliness.
This was referenced Apr 3, 2026
This was referenced Apr 7, 2026
zeenix
pushed a commit
to z-galaxy/busd
that referenced
this pull request
Apr 9, 2026
Apply tokio update from #315 to test whether tokio 1.51.0's LIFO slot stealing change (tokio-rs/tokio#7431) is what was triggering the test flakiness that the port-conflict fix addresses. https://claude.ai/code/session_015hha9ShFQ1MdZrkQJMWTFH
zeenix
pushed a commit
to z-galaxy/busd
that referenced
this pull request
Apr 9, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump. Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker *before* `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has had a chance to call `MessageStream::from(peer.conn())` (which is what activates a receiver on the unfiltered broadcast channel), the message is silently dropped — zbus's socket reader swallows `TrySendError::Inactive` for the generic channel. The affected client then hangs forever waiting for its `Hello` reply. The fix lives in zbus: `connection::Builder::build_message_stream` activates a receiver before the socket reader task is spawned, so no messages can be lost in that race window. See z-galaxy/zbus#1760. On the busd side: - `Peer::new` now returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` used to grab the connection for the `Peer` struct. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so the self-dial peer gets the same race-free treatment. - `Peers::{add, add_us}` destructure the pair and drop the now-dead `peer.stream()` call. - `src/bus/mod.rs` builds the self-dial `peer_stream` via `build_message_stream` too. A `[patch.crates-io]` entry pins zbus to the PR commit; it will be removed once a zbus release containing `build_message_stream` is out.
zeenix
pushed a commit
to z-galaxy/busd
that referenced
this pull request
Apr 9, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump. Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker *before* `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has had a chance to call `MessageStream::from(peer.conn())` (which is what activates a receiver on the unfiltered broadcast channel), the message is silently dropped — zbus's socket reader swallows `TrySendError::Inactive` for the generic channel. The affected client then hangs forever waiting for its `Hello` reply. The fix lives in zbus: `connection::Builder::build_message_stream` activates a receiver before the socket reader task is spawned, so no messages can be lost in that race window. See z-galaxy/zbus#1760. On the busd side: - `Peer::new` now returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` used to grab the connection for the `Peer` struct. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so the self-dial peer gets the same race-free treatment. - `Peers::{add, add_us}` destructure the pair and drop the now-dead `peer.stream()` call. - `src/bus/mod.rs` builds the self-dial `peer_stream` via `build_message_stream` too. A `[patch.crates-io]` entry pins zbus to the PR commit; it will be removed once a zbus release containing `build_message_stream` is out.
5 tasks
zeenix
pushed a commit
to z-galaxy/busd
that referenced
this pull request
Apr 10, 2026
Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump. Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker *before* `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has had a chance to call `MessageStream::from(peer.conn())` (which is what activates a receiver on the unfiltered broadcast channel), the message is silently dropped — zbus's socket reader swallows `TrySendError::Inactive` for the generic channel. The affected client then hangs forever waiting for its `Hello` reply. The fix lives in zbus: `connection::Builder::build_message_stream` activates a receiver before the socket reader task is spawned, so no messages can be lost in that race window. See z-galaxy/zbus#1760. On the busd side: - `Peer::new` now returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` used to grab the connection for the `Peer` struct. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so the self-dial peer gets the same race-free treatment. - `Peers::{add, add_us}` destructure the pair and drop the now-dead `peer.stream()` call. - `src/bus/mod.rs` builds the self-dial `peer_stream` via `build_message_stream` too. A `[patch.crates-io]` entry pins zbus to the PR commit; it will be removed once a zbus release containing `build_message_stream` is out.
zeenix
pushed a commit
to z-galaxy/busd
that referenced
this pull request
Apr 10, 2026
Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker before `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has activated a receiver via `MessageStream::from(peer.conn())`, the message is silently dropped — the affected client hangs forever waiting for its reply. Use the new `Builder::build_message_stream` (z-galaxy/zbus#1760) which activates a receiver before the socket-reader task is spawned, closing the race window entirely. - `Peer::new` returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` to extract the connection. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`. - `Peers::{add, add_us}` destructure the pair; the now-unused `Peer::stream()` accessor is removed. - `bus::Bus::for_address` builds the self-dial peer stream the same way.
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 14, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 14, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 15, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 15, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 15, 2026
This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw
added a commit
to oxidecomputer/oxide-tokio-rt
that referenced
this pull request
Apr 15, 2026
Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Apr 15, 2026
update `oxide-tokio-rt` to v0.1.4, `tokio` to v1.52.0 This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio v1.52.0 and the corresponding changes to the default runtime settings in `oxide-tokio-rt`. In particular, this allows us to pick up two of my upstream fixes in Tokio for a pair of issues that have been major thorns in our side for some time: * Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0], changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the pathology described in #8334, where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding, can no longer occur. `oxide-tokio-rt` v0.1.4 removes the runtime configuration to disable the LIFO slot as the issue has been fixed upstream. * Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds eager handoff for the I/O and time drivers in the multi-threaded runtime. This is currently an experimental feature, although it is your author's opinion that this is really a fix for incorrect runtime behavior. It changes worker threads in the multi-threaded runtime to wake another worker prior to polling tasks if that worker had previously been parked on the I/O driver or timer wheel. Eagerly handing off these resources should prevent pathologies such as #9619. `oxide-tokio-rt` v0.1.4 enables this behavior by default. Fixes #8334 Fixes #9619 [Tokio v1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [Tokio v1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Apr 15, 2026
This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio v1.52.0 and the corresponding changes to the default runtime settings in `oxide-tokio-rt`. In particular, this allows us to pick up two of my upstream fixes in Tokio for a pair of issues that have been major thorns in our side for some time: * Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0], changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the pathology described in #8334, where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding, can no longer occur. `oxide-tokio-rt` v0.1.4 removes the runtime configuration to disable the LIFO slot as the issue has been fixed upstream. * Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds eager handoff for the I/O and time drivers in the multi-threaded runtime. This is currently an experimental feature, although it is your author's opinion that this is really a fix for incorrect runtime behavior.[^1] It changes worker threads in the multi-threaded runtime to wake another worker prior to polling tasks if that worker had previously been parked on the I/O driver or timer wheel. Eagerly handing off these resources should prevent pathologies such as #9619. `oxide-tokio-rt` v0.1.4 enables this behavior by default. Fixes #8334 Fixes #9619 [Tokio v1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [Tokio v1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0 [^1]: Cue @leftwo's dictum that a "performance regression" from fixing incorrect behavior...isn't a performance regression.
hawkw
added a commit
to oxidecomputer/omicron
that referenced
this pull request
Apr 16, 2026
Commit #10272 updated our dependency on `oxide-tokio-rt` to v0.1.4 and our `tokio` dependency to v1.52.0. This allowed us to pick up two of my fixes for Tokio issues that have been a thorn in our side for a long time, tokio-rs/tokio#7431 and tokio-rs/tokio#8010, which fix #8334 and #9619, respectively. The nature of these fixes is described in greater detail in #10272. Unfortunately, #10272 had to be reverted (in #10279), since @iliana discovered an unrelated regression in Tokio v1.52.0, tokio-rs/tokio#8056 (our issue #10277). This regression caused `spawn_blocking` to occasionally hang, and was introduced in tokio-rs/tokio@1604bc3 (PR tokio-rs/tokio#7757). I've since reverted this change upstream (tokio-rs/tokio#8057), and published a patch release ([v1.52.1]), which fixes the regression. Therefore, it is now once again safe to update our Tokio dependency to pick up the other fixes. This commit does that. I've also confirmed that the issue described in #10277 is no longer present in Tokio v1.52.1, as demonstrated by the fact `cargo nextest run -p omicron-sled-agent --stress-count 100 -- --exact artifact_store::test::issue_7796` now succeeds without hanging once again. Fixes #10272
hawkw
added a commit
that referenced
this pull request
Apr 20, 2026
Currently, the `rt_threaded::lifo_stealable` test I added in #7431 spawns an additional task which sleeps on a 4ms timer in a loop. This ensures that no worker remains permanently parked. This was added because it was necessary to stop the LIFO slot deadlock from occurring prior to changes in the logic for determining whether to notify another worker, which is what @Darksonn was referring to in [this comment][1]. Removing the `churn()` test makes the test actually validate that another worker is notified to steal the LIFO task, and that the changes from #7431 will *always* prevent a LIFO slot deadlock, regardless of the behavior of other tasks on the runtime. See also [this comment][2] for further discussion. [1]: #7431 (comment) [2]: #8069 (comment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Motivation
Worker threads in the multi-threaded runtime include a per-worker LIFO
slot which stores the last task notified by another task running on that
worker. This allows the last-notified task to be polled first when the
currently running task completes, decreasing latency in message-passing
"ping-pong" scenarios.
However --- as described in #4941 --- there's an issue with this that
can cause severe problems in some situations: the task in the LIFO slot
cannot be stolen by a work-stealing worker thread. This means that if a
task notifies another task and then goes CPU bound for a long period of
time without yielding, the notified task will never be able to execute
until the task that notified it can yield. This can result in a very
severe latency bubble in some scenarios. See, for instance, #4323,
#6954, oxidecomputer/omicron#8334, etc.
As a workaround, PR #4936 added an unstable
runtime::Builderoptionto disable the LIFO slot. However, this is a less-than-ideal
solution, as it means that applications which disable the LIFO slot due
to occasional usage patterns that cause latency bubbles when it is
enabled cannot benefit from the potential latency improvements it
offers in other usage patterns. And, it's an unstable option which the
user has to discover. In most cases, people whose programs contain usage
patterns that are pathological with regards to the LIFO slot don't know
this ahead of time: the typical narrative is that you write code that
happens to follow such a pattern, discover an unexpected latency spike
or hang in production, and then learn how to disable the LIFO slot. It
would be much nicer if the task in the LIFO slot could participate in
work-stealing like every other task.
Solution
This branch makes tasks in the LIFO slot stealable.
Broadly, I've taken the following approach:
Add a new
AtomicNotifiedtype that implements an atomicallyswappable cell storing a
Notifiedtask(1220253), and use it to represent
the LIFO slot instead of an
Option<Notified<Arc<Handle>>>. Thisway, other workers can take a task out of the LIFO slot while
work-stealing.
Move the LIFO slot out of the
worker::Corestruct and into the runqueue's
Innertype (75d8116),making it shared state between the
Localside of the queue owned bythe worker itself and the
Stealside used by remote workers tosteal from the queue.
There's a bunch of additional code in
worker::Corefor managingwhether to atually run a task from the LIFO slot or not. I opted
not to move any of this code into the run queue itself, as it
depends on other bits of internal worker state. Instead, we just
expose to the worker separate APIs for pushing/popping to/from the
main queue and for pushing/popping to/from the LIFO slot, resulting
in a fairly small diff to the worker's run loop.
Change the work-stealing code to also steal the LIFO task
(730a581 and
cb27dda). This actually turned out
to be pretty straightforwards: once we've stolen a chunk of tasks
from the targeted worker's run queue, we now also grab whatever's in
its' LIFO slot as well. If we stole a LIFO task, it's returned from
the
steal_intomethod in lieu of the first task in the run queue, sothat it gets to execute first, maintaining the latency improvement for
recently-notified tasks. This also was simpler than trying to wedge it
into the chunk of tasks to be pushed to the stealer's queue.
I've made the following test changes while working on this branch:
the LIFO slot is blocked from running when a task notifies it and then
blocks indefinitely. I've confirmed that this test fails on
masterand passes on this branch.
These are in addition to the existing work-stealing Loom tests, as
tasks notified by I/O or timers are still woken to the normal run
queue.
rt_unstable_metrics::worker_local_queue_depthintegration test,which was necessary as tasks in the LIFO slot now "count" towards the
worker's queue depth. We now have to make sure the no-op task that's
spawned has completed before spawning the tasks we actually attempt to
count, as it seems to sometimes end up in the LIFO slot and sometimes
not, causing the test to flake out.
Fixes #4941