Skip to content

Reduce sharing in struct client between main and IO threads#15006

Open
abokhalill wants to merge 1 commit intoredis:unstablefrom
abokhalill:fix/client-struct-false-sharing
Open

Reduce sharing in struct client between main and IO threads#15006
abokhalill wants to merge 1 commit intoredis:unstablefrom
abokhalill:fix/client-struct-false-sharing

Conversation

@abokhalill
Copy link
Copy Markdown

@abokhalill abokhalill commented Apr 6, 2026

Cache line 0 of struct client mixes main thread hot fields (flags, conn)
with IO-thread-hot fields (tid, io_flags). Under threaded I/O, every command
dispatch invalidates the IO thread's L1d copy of the line and vice versa.

This patch reorders fields so IO-thread fields start at cache line 1. Pure field
reorder.

Microbenchmark (500K samples, pinned to separate physical cores, AMD Zen 4):

Shared line Isolated Delta
Mean TSC/iter 103.6 88.9 −14%
p99 TSC/iter 141 94 −33%

Welch's t = 5.92, p = 0.0001.

Builds clean, passes redis-benchmark --threads 4 with --io-threads 4.

Found via static analysis of struct layout relative to cache line boundaries.


Note

Medium Risk
Reorders fields in the hot-path client struct to change memory layout, which can subtly affect alignment/ABI assumptions and any code relying on field offsets, despite no logic changes.

Overview
Reorders struct client fields in server.h to separate main-thread hot fields (id/flags/conn/resp/db metadata) from IO-thread hot fields (tid/running_tid/io_flags/read_error) onto different cache lines, reducing false sharing under threaded I/O.

Adds an explicit padding int (_pad0) and cache-line grouping comments to keep alignment stable after the reorder.

Reviewed by Cursor Bugbot for commit 649cc36. Bugbot is set up for automated code reviews on this repo. Configure here.

Reorder fields in struct client so that IO thread hot fields (tid,
running_tid, io_flags, read_error) start at cache line 1 instead of
sharing cache line 0 with main thread hot fields.

Under threaded I/O, the main thread writes flags on every command
dispatch while the IO thread reads/writes io_flags on every I/O event.
With both on cache line 0, each access triggers a cross core L1d
invalidation.

Before: 103.6 TSC/iter mean, 141 TSC p99
After:  88.9 TSC/iter mean, 94 TSC p99
Delta: -14.2% mean, -33.3% tail latency per atomic store
Welch t-test: t=5.92, p=0.0001
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 6, 2026

CLA assistant check
All committers have signed the CLA.

@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 6, 2026

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

@ShooterIT
Copy link
Copy Markdown
Member

is there any performance improvement?

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

@ShooterIT we've just kicked off the following Redis benchmarks on bare-metal AWS runners (3 datapoints each, both x86 m7i and ARM m8g) to verify the gains under high-concurrency io-threads workloads:

  • memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-16 (2000 conns, pipeline-16)
  • memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10 (2000 conns, pipeline-10)
  • memtier_benchmark-3Mkeys-string-mixed-20-80-with-512B-values-pipeline-10-5200_conns (5200 conns, mixed)
  • memtier_benchmark-3Mkeys-string-get-with-1KiB-values-pipeline-10-2000_conns (2000 conns, read-only)

All four use the oss-standalone-12/16-io-threads topologies — these should stress the main↔IO-thread false-sharing window the most.

One caveat worth flagging: x86 Sapphire Rapids has 64-byte cache lines while ARM Neoverse V2 (m8g) has 128-byte cache lines. The _pad0 separation may not be sufficient on ARM if the surrounding fields still land within the same 128-byte line — so the gains might hold on x86 but be flat (or need extra padding) on ARM. We'll see what the data shows.

Will report numbers once the runs complete (~30-45 min).

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

fcostaoliveira commented Apr 10, 2026

CE Performance Automation : step 2 of 2 (benchmark) RUNNING...

This comment was automatically generated given a benchmark was triggered.

Started benchmark suite at 2026-04-10 13:34:18.218282 and took 602.649904 seconds up until now.
Status: [####################------------------------------------------------------------] 25.0% completed.

In total will run 4 benchmarks.
- 3 pending.
- 1 completed:
- 1 successful.
- 0 failed.
You can check a the status in detail via the grafana link

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

CE Performance Automation : step 1 of 2 (build) STARTING...

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-10 14:54:53.659481
You can check each build/benchmark progress in grafana:

  • git hash: ae95526
  • git branch: unstable
  • commit date and time: 2026-04-09 17:58:37+03:00
  • commit summary: RED-183356: Automate tarball creation (RED-183356: Automate tarball creation #14911)
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 100000
    • test name regex: memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-16|memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10|memtier_benchmark-3Mkeys-string-mixed-20-80-with-512B-values-pipeline-10-5200_conns|memtier_benchmark-3Mkeys-string-get-with-1KiB-values-pipeline-10-2000_conns
    • command group regex: .*

@fcostaoliveira
Copy link
Copy Markdown
Collaborator

fcostaoliveira commented Apr 10, 2026

CE Performance Automation : step 1 of 2 (build) DONE.

This comment was automatically generated given a benchmark was triggered.
Started building at 2026-04-11 19:56:20.515020 and took 58 seconds.
You can check each build/benchmark progress in grafana:

  • git hash: 0d85627
  • git branch: unstable
  • commit date and time: 2026-04-10 23:25:56+08:00
  • commit summary: Use no_value dict type for stream_idmp_keys to explicitly mark it as a key-only set (Use no_value dict type for stream_idmp_keys to explicitly mark it as a key-only set #14987)
  • test filters:
    • command priority lower limit: 0
    • command priority upper limit: 100000
    • test name regex: memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-16|memtier_benchmark-1Mkeys-string-setget2000c-1KiB-pipeline-10|memtier_benchmark-3Mkeys-string-get-with-1KiB-values-pipeline-10-2000_conns
    • command group regex: .*

You can check a comparison in detail via the grafana link

@abokhalill
Copy link
Copy Markdown
Author

@fcostaoliveira did the bare metal benchmark results come back? I see the automation completed on Apr 10-11 but didn't see the final numbers posted. I'm happy to address the ARM 128 byte cl concern with additional padding if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants