Reduce sharing in struct client between main and IO threads#15006
Reduce sharing in struct client between main and IO threads#15006abokhalill wants to merge 1 commit intoredis:unstablefrom
Conversation
Reorder fields in struct client so that IO thread hot fields (tid, running_tid, io_flags, read_error) start at cache line 1 instead of sharing cache line 0 with main thread hot fields. Under threaded I/O, the main thread writes flags on every command dispatch while the IO thread reads/writes io_flags on every I/O event. With both on cache line 0, each access triggers a cross core L1d invalidation. Before: 103.6 TSC/iter mean, 141 TSC p99 After: 88.9 TSC/iter mean, 94 TSC p99 Delta: -14.2% mean, -33.3% tail latency per atomic store Welch t-test: t=5.92, p=0.0001
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
|
is there any performance improvement? |
|
@ShooterIT we've just kicked off the following Redis benchmarks on bare-metal AWS runners (3 datapoints each, both x86 m7i and ARM m8g) to verify the gains under high-concurrency io-threads workloads:
All four use the One caveat worth flagging: x86 Sapphire Rapids has 64-byte cache lines while ARM Neoverse V2 (m8g) has 128-byte cache lines. The Will report numbers once the runs complete (~30-45 min). |
CE Performance Automation : step 2 of 2 (benchmark) RUNNING...This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2026-04-10 13:34:18.218282 and took 602.649904 seconds up until now. In total will run 4 benchmarks. |
CE Performance Automation : step 1 of 2 (build) STARTING...This comment was automatically generated given a benchmark was triggered.
|
CE Performance Automation : step 1 of 2 (build) DONE.This comment was automatically generated given a benchmark was triggered.
You can check a comparison in detail via the grafana link |
|
@fcostaoliveira did the bare metal benchmark results come back? I see the automation completed on Apr 10-11 but didn't see the final numbers posted. I'm happy to address the ARM 128 byte cl concern with additional padding if needed. |
Cache line 0 of
struct clientmixes main thread hot fields (flags,conn)with IO-thread-hot fields (
tid,io_flags). Under threaded I/O, every commanddispatch invalidates the IO thread's L1d copy of the line and vice versa.
This patch reorders fields so IO-thread fields start at cache line 1. Pure field
reorder.
Microbenchmark (500K samples, pinned to separate physical cores, AMD Zen 4):
Welch's t = 5.92, p = 0.0001.
Builds clean, passes
redis-benchmark --threads 4with--io-threads 4.Found via static analysis of struct layout relative to cache line boundaries.
Note
Medium Risk
Reorders fields in the hot-path
clientstruct to change memory layout, which can subtly affect alignment/ABI assumptions and any code relying on field offsets, despite no logic changes.Overview
Reorders
struct clientfields inserver.hto separate main-thread hot fields (id/flags/conn/resp/db metadata) from IO-thread hot fields (tid/running_tid/io_flags/read_error) onto different cache lines, reducing false sharing under threaded I/O.Adds an explicit padding int (
_pad0) and cache-line grouping comments to keep alignment stable after the reorder.Reviewed by Cursor Bugbot for commit 649cc36. Bugbot is set up for automated code reviews on this repo. Configure here.