runtime: steal tasks from the LIFO slot by hawkw · Pull Request #7431 · tokio-rs/tokio

hawkw · 2025-06-27T22:46:55Z

Motivation

Worker threads in the multi-threaded runtime include a per-worker LIFO
slot which stores the last task notified by another task running on that
worker. This allows the last-notified task to be polled first when the
currently running task completes, decreasing latency in message-passing
"ping-pong" scenarios.

However --- as described in #4941 --- there's an issue with this that
can cause severe problems in some situations: the task in the LIFO slot
cannot be stolen by a work-stealing worker thread. This means that if a
task notifies another task and then goes CPU bound for a long period of
time without yielding, the notified task will never be able to execute
until the task that notified it can yield. This can result in a very
severe latency bubble in some scenarios. See, for instance, #4323,
#6954, oxidecomputer/omicron#8334, etc.

As a workaround, PR #4936 added an unstable runtime::Builder option
to disable the LIFO slot. However, this is a less-than-ideal
solution, as it means that applications which disable the LIFO slot due
to occasional usage patterns that cause latency bubbles when it is
enabled cannot benefit from the potential latency improvements it
offers in other usage patterns. And, it's an unstable option which the
user has to discover. In most cases, people whose programs contain usage
patterns that are pathological with regards to the LIFO slot don't know
this ahead of time: the typical narrative is that you write code that
happens to follow such a pattern, discover an unexpected latency spike
or hang in production, and then learn how to disable the LIFO slot. It
would be much nicer if the task in the LIFO slot could participate in
work-stealing like every other task.

Solution

This branch makes tasks in the LIFO slot stealable.

Broadly, I've taken the following approach:

Add a new AtomicNotified type that implements an atomically
swappable cell storing a Notified task
(1220253), and use it to represent
the LIFO slot instead of an Option<Notified<Arc<Handle>>>. This
way, other workers can take a task out of the LIFO slot while
work-stealing.
Move the LIFO slot out of the worker::Core struct and into the run
queue's Inner type (75d8116),
making it shared state between the Local side of the queue owned by
the worker itself and the Steal side used by remote workers to
steal from the queue.

There's a bunch of additional code in worker::Core for managing
whether to atually run a task from the LIFO slot or not. I opted
not to move any of this code into the run queue itself, as it
depends on other bits of internal worker state. Instead, we just
expose to the worker separate APIs for pushing/popping to/from the
main queue and for pushing/popping to/from the LIFO slot, resulting
in a fairly small diff to the worker's run loop.
Change the work-stealing code to also steal the LIFO task
(730a581 and
cb27dda). This actually turned out
to be pretty straightforwards: once we've stolen a chunk of tasks
from the targeted worker's run queue, we now also grab whatever's in
its' LIFO slot as well. If we stole a LIFO task, it's returned from
the steal_into method in lieu of the first task in the run queue, so
that it gets to execute first, maintaining the latency improvement for
recently-notified tasks. This also was simpler than trying to wedge it
into the chunk of tasks to be pushed to the stealer's queue.

I've made the following test changes while working on this branch:

Added an integration test that reproduces a scenario where a task in
the LIFO slot is blocked from running when a task notifies it and then
blocks indefinitely. I've confirmed that this test fails on master
and passes on this branch.
Added new Loom tests for work-stealing involving the LIFO slot.
These are in addition to the existing work-stealing Loom tests, as
tasks notified by I/O or timers are still woken to the normal run
queue.
A small change to the
rt_unstable_metrics::worker_local_queue_depth integration test,
which was necessary as tasks in the LIFO slot now "count" towards the
worker's queue depth. We now have to make sure the no-op task that's
spawned has completed before spawning the tasks we actually attempt to
count, as it seems to sometimes end up in the LIFO slot and sometimes
not, causing the test to flake out.

Fixes #4941

This commit changes the LIFO slot on multi-threaded runtime workers from being a mutable `Option<Notified<Arc<Handle>>>` to a new `AtomicNotified` type. This type implements a cell containing a nullable task pointer which can be swapped atomically. It's analogous to `AtomicCell` but with the extra `PhantomData` to remember the task's scheduler type parameter, which would otherwise be erased by the conversion into a `*mut Header`` pointer. This change is in preparation for a subsequent change to allow work-stealing from the LIFO slot (see: #4941).

This way, it's accesible by the stealer. Leave all the LIFO *accounting* (i.e. deciding whether we hit the LIFO slot or not) up to the worker. Gotta figure out whether the load of lifo presence will race...ugh.

This commit adds a test ensuring that if a task is notified to the LIFO slot by another task which then blocks the worker thread forever, the LIFO task is eventually stolen by another worker. I've confimed that this test fails on the `master` branch, and passes after these changes.

This test spawns a task that sometimes ends up in the LIFO slot and sometimes doesn't. This was previously fine as the LIFO slot didn't count for `worker_local_queue_depth`, but now it does. Thus, we have to make sure that task no longer exists before asserting about queue depth.

hawkw · 2025-07-01T19:42:15Z

(shoutout to @mkeeter for nerd-sniping me into actually doing this)

ADD-SP · 2025-07-02T01:53:21Z

+        // If we also grabbed the task from the LIFO slot, include that in the
+        // steal count as well.
+        dst_stats.incr_steal_count(n as u16 + lifo.is_some() as u16);


For reviewers: Rust has defined the casting from bool to u16.

It's also possible to use u16::from(bool), if you care about cast_lossless cleanliness.

Apply tokio update from #315 to test whether tokio 1.51.0's LIFO slot stealing change (tokio-rs/tokio#7431) is what was triggering the test flakiness that the port-conflict fix addresses. https://claude.ai/code/session_015hha9ShFQ1MdZrkQJMWTFH

Fix the multi_conenct test timeout introduced by the tokio 1.51.1 bump. Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker *before* `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has had a chance to call `MessageStream::from(peer.conn())` (which is what activates a receiver on the unfiltered broadcast channel), the message is silently dropped — zbus's socket reader swallows `TrySendError::Inactive` for the generic channel. The affected client then hangs forever waiting for its `Hello` reply. The fix lives in zbus: `connection::Builder::build_message_stream` activates a receiver before the socket reader task is spawned, so no messages can be lost in that race window. See z-galaxy/zbus#1760. On the busd side: - `Peer::new` now returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` used to grab the connection for the `Peer` struct. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`, so the self-dial peer gets the same race-free treatment. - `Peers::{add, add_us}` destructure the pair and drop the now-dead `peer.stream()` call. - `src/bus/mod.rs` builds the self-dial `peer_stream` via `build_message_stream` too. A `[patch.crates-io]` entry pins zbus to the PR commit; it will be removed once a zbus release containing `build_message_stream` is out.

Under tokio 1.51's LIFO slot stealing (tokio-rs/tokio#7431), the socket-reader task zbus spawns inside `Connection::Builder::build()` can start running on another worker before `build()` returns to busd. If it reads a pipelined `Hello` and broadcasts it before busd has activated a receiver via `MessageStream::from(peer.conn())`, the message is silently dropped — the affected client hangs forever waiting for its reply. Use the new `Builder::build_message_stream` (z-galaxy/zbus#1760) which activates a receiver before the socket-reader task is spawned, closing the race window entirely. - `Peer::new` returns `(Peer, Stream)` built via `build_message_stream`, with `Connection::from(&stream)` to extract the connection. - `Peer::new_us` takes a `MessageStream` instead of a `Connection`. - `Peers::{add, add_us}` destructure the pair; the now-unused `Peer::stream()` accessor is removed. - `bus::Bus::for_address` builds the self-dial peer stream the same way.

This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0

Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.

This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0

Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.

This updates our minimum Tokio version to [1.52.0]. This allows us to pick up two major fixes that will change our default runtime configuration: - tokio-rs/tokio#8010 (released in [1.52.0]) which fixes oxidecomputer/omicron#9619 when its builder option is enabled, - tokio-rs/tokio#7431,(released in 1.51.0), which allows tasks in the LIFO slot to participate in work-stealing. Subsequent commits will actually update our runtime settings after picking up these releases. [1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0

Tokio PR tokio-rs/tokio#7431, released in v1.51.0, changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the apathology where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding can no longer occur.

update `oxide-tokio-rt` to v0.1.4, `tokio` to v1.52.0 This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio v1.52.0 and the corresponding changes to the default runtime settings in `oxide-tokio-rt`. In particular, this allows us to pick up two of my upstream fixes in Tokio for a pair of issues that have been major thorns in our side for some time: * Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0], changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the pathology described in #8334, where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding, can no longer occur. `oxide-tokio-rt` v0.1.4 removes the runtime configuration to disable the LIFO slot as the issue has been fixed upstream. * Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds eager handoff for the I/O and time drivers in the multi-threaded runtime. This is currently an experimental feature, although it is your author's opinion that this is really a fix for incorrect runtime behavior. It changes worker threads in the multi-threaded runtime to wake another worker prior to polling tasks if that worker had previously been parked on the I/O driver or timer wheel. Eagerly handing off these resources should prevent pathologies such as #9619. `oxide-tokio-rt` v0.1.4 enables this behavior by default. Fixes #8334 Fixes #9619 [Tokio v1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [Tokio v1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0

@leftwo

This branch updates our dependency on `oxide-tokio-rt` to pick up Tokio v1.52.0 and the corresponding changes to the default runtime settings in `oxide-tokio-rt`. In particular, this allows us to pick up two of my upstream fixes in Tokio for a pair of issues that have been major thorns in our side for some time: * Tokio PR tokio-rs/tokio#7431, released in [Tokio v1.51.0], changes the multi-threaded runtime to allow tasks in the LIFO slot to participate in work-stealing. Therefore, it should no longer be necessary to disable the LIFO slot optimization, as the pathology described in #8334, where a task placed in the LIFO slot can become permanently or semi-permanently stuck while the task that notified them runs for a long time without yielding, can no longer occur. `oxide-tokio-rt` v0.1.4 removes the runtime configuration to disable the LIFO slot as the issue has been fixed upstream. * Tokio PR tokio-rs/tokio#8010, released in [Tokio v1.52.0], which adds eager handoff for the I/O and time drivers in the multi-threaded runtime. This is currently an experimental feature, although it is your author's opinion that this is really a fix for incorrect runtime behavior.[^1] It changes worker threads in the multi-threaded runtime to wake another worker prior to polling tasks if that worker had previously been parked on the I/O driver or timer wheel. Eagerly handing off these resources should prevent pathologies such as #9619. `oxide-tokio-rt` v0.1.4 enables this behavior by default. Fixes #8334 Fixes #9619 [Tokio v1.52.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.52.0 [Tokio v1.51.0]: https://github.com/tokio-rs/tokio/releases/tag/tokio-1.51.0 [^1]: Cue @leftwo's dictum that a "performance regression" from fixing incorrect behavior...isn't a performance regression.

@iliana

Commit #10272 updated our dependency on `oxide-tokio-rt` to v0.1.4 and our `tokio` dependency to v1.52.0. This allowed us to pick up two of my fixes for Tokio issues that have been a thorn in our side for a long time, tokio-rs/tokio#7431 and tokio-rs/tokio#8010, which fix #8334 and #9619, respectively. The nature of these fixes is described in greater detail in #10272. Unfortunately, #10272 had to be reverted (in #10279), since @iliana discovered an unrelated regression in Tokio v1.52.0, tokio-rs/tokio#8056 (our issue #10277). This regression caused `spawn_blocking` to occasionally hang, and was introduced in tokio-rs/tokio@1604bc3 (PR tokio-rs/tokio#7757). I've since reverted this change upstream (tokio-rs/tokio#8057), and published a patch release ([v1.52.1]), which fixes the regression. Therefore, it is now once again safe to update our Tokio dependency to pick up the other fixes. This commit does that. I've also confirmed that the issue described in #10277 is no longer present in Tokio v1.52.1, as demonstrated by the fact `cargo nextest run -p omicron-sled-agent --stress-count 100 -- --exact artifact_store::test::issue_7796` now succeeds without hanging once again. Fixes #10272

@Darksonn

Currently, the `rt_threaded::lifo_stealable` test I added in #7431 spawns an additional task which sleeps on a 4ms timer in a loop. This ensures that no worker remains permanently parked. This was added because it was necessary to stop the LIFO slot deadlock from occurring prior to changes in the logic for determining whether to notify another worker, which is what @Darksonn was referring to in [this comment][1]. Removing the `churn()` test makes the test actually validate that another worker is notified to steal the LIFO task, and that the changes from #7431 will *always* prevent a LIFO slot deadlock, regardless of the behavior of other tasks on the runtime. See also [this comment][2] for further discussion. [1]: #7431 (comment) [2]: #8069 (comment)

hawkw added 3 commits June 27, 2025 14:15

runtime: start moving LIFO into the run queue

75d8116

This way, it's accesible by the stealer. Leave all the LIFO *accounting* (i.e. deciding whether we hit the LIFO slot or not) up to the worker. Gotta figure out whether the load of lifo presence will race...ugh.

runtime: actually steal LIFO task (stupid version)

730a581

github-actions Bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt labels Jun 27, 2025

hawkw and others added 3 commits June 27, 2025 15:47

Merge branch 'master' into eliza/lifo-steal

a5a656d

runtime: docs fixes

4c680ae

runtime: add loom tests with lifo slot

9ad2404

hawkw removed the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025

github-actions Bot added the R-loom-current-thread Run loom current-thread tests on this PR label Jun 30, 2025

hawkw force-pushed the eliza/lifo-steal branch from e435d43 to fe53128 Compare June 30, 2025 18:54

hawkw mentioned this pull request Jun 30, 2025

rt: make the LIFO slot in the multi-threaded scheduler stealable #4941

Closed

hawkw and others added 5 commits June 30, 2025 12:53

runtime: fix remote queue depth for metrics

86b16c4

Merge branch 'master' into eliza/lifo-steal

9651b82

runtime: fix lifo loom tests not popping from their own lifo slots

0497b31

runtime: fix wrong accounting for LIFO slot in new queue len

cb27dda

hawkw changed the title ~~[WIP] runtime: steal from the LIFO slot~~ runtime: steal tasks from the LIFO slot Jul 1, 2025

hawkw self-assigned this Jul 1, 2025

hawkw added C-enhancement Category: A PR with an enhancement or bugfix. A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jul 1, 2025

hawkw marked this pull request as ready for review July 1, 2025 19:39

hawkw requested review from ADD-SP, Darksonn and carllerche July 1, 2025 19:39

ADD-SP reviewed Jul 2, 2025

View reviewed changes

Darksonn merged commit eeb55c7 into master Apr 3, 2026
94 checks passed

Darksonn deleted the eliza/lifo-steal branch April 3, 2026 07:12

This was referenced Apr 3, 2026

chore: prepare Tokio v1.51.0 #8005

Merged

metrics: fix worker_local_schedule_count test #8008

Merged

hawkw mentioned this pull request Apr 3, 2026

Consider removing disable_lifo_slot as of Tokio 1.51 oxidecomputer/oxide-tokio-rt#6

Open

This was referenced Apr 7, 2026

Breaking: Add optional metrique Entry derive for RuntimeMetrics tokio-rs/tokio-metrics#121

Merged

docs(runtime): Fix doctests failing after Tokio v1.51 tokio-rs/tokio-metrics#122

Merged

zeenix mentioned this pull request Apr 9, 2026

🐛 Fix Hello drop race exposed by tokio 1.51 LIFO stealing z-galaxy/busd#320

Merged

5 tasks

hawkw mentioned this pull request Apr 15, 2026

tokio 1.52.0 and related changes oxidecomputer/oxide-tokio-rt#7

Merged

hawkw mentioned this pull request Apr 15, 2026

update oxide-tokio-rt to v0.1.4, tokio to v1.52.0 oxidecomputer/omicron#10272

Merged

hawkw mentioned this pull request Apr 16, 2026

update to oxide-tokio-rt v0.1.4 (again), tokio v1.52.1 oxidecomputer/omicron#10282

Merged

alexcrichton mentioned this pull request Apr 16, 2026

Panic in subtraction with overflow in tokio/src/sync/mpsc/list.rs #8061

Open

kushudai mentioned this pull request Apr 18, 2026

Performance regression on high-QPS short-handler workloads since 1.51.0 #8065

Open

hawkw mentioned this pull request Apr 18, 2026

test: remove churn() task from lifo_stealable #8070

Merged

thieman mentioned this pull request Apr 20, 2026

chore(deps): add OTLP ingest throughput benchmark and tokio regression repro DataDog/saluki#1417

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: steal tasks from the LIFO slot#7431

runtime: steal tasks from the LIFO slot#7431
Darksonn merged 26 commits intomasterfrom
eliza/lifo-steal

hawkw commented Jun 27, 2025 •

edited

Loading

Uh oh!

hawkw commented Jul 1, 2025

Uh oh!

ADD-SP Jul 2, 2025

Uh oh!

mkeeter Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

hawkw commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Uh oh!

hawkw commented Jul 1, 2025

Uh oh!

ADD-SP Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

mkeeter Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hawkw commented Jun 27, 2025 •

edited

Loading