Tear down connection on write failure to prevent queue desync by Pranish-Pantha · Pull Request #3092 · StackExchange/StackExchange.Redis

Pranish-Pantha · 2026-05-26T23:22:34Z

Summary

Message.WriteTo now re-throws after calling Fail(), so PhysicalBridge's outer write path can act on the failure instead of seeing a phantom success.
PhysicalBridge.HandleWriteException takes the PhysicalConnection and calls RecordConnectionFailed(ConnectionFailureType.InternalFailure, ...), tearing the connection down and draining the response queue with failures.
All callers of HandleWriteException (sync, async, backlog, post-flush completion) thread the connection through.

Fixes the underlying cause behind the symptoms reported in #2883, #2804, #2919, #2913 — without depending on HighIntegrity mode (which catches the desync after the fact rather than preventing it).

See accompanying issue for the full root-cause writeup.

Why

If WriteImpl throws partway through a frame (OOM during serialization is the most-reported trigger, but any exception out of WriteImpl does it), the old WriteTo swallowed the exception. The bridge then saw WriteResult.Success, kept the message at the head of the in-flight queue, and left the socket connected. Any subsequent reply from the server matched against the wrong message.

Fail() signals the broken message's awaiter, but it does nothing about the wire state or the sibling messages whose ordering has now been corrupted. Tearing down the physical connection completes all in-flight awaiters with RedisConnectionException(InternalFailure) and forces a clean reconnect.

Test plan

New WriteTo_PropagatesWriteImplException — fails on main, passes here. Asserts WriteTo rethrows non-RedisCommandException from WriteImpl.
New WriteTo_DoesNotWrapRedisCommandException — RedisCommandException continues to surface unwrapped (it's excluded from the catch filter).
New WriteFailure_TearsDownPhysicalConnection — end-to-end: a Message whose WriteImpl throws faults the awaiter with RedisConnectionException(InternalFailure) and raises a ConnectionFailed event with InternalFailure.
Verified the new tests fail on the unfixed source and pass with the fix applied (stash/unstash cycle).
dotnet build clean for src/StackExchange.Redis/StackExchange.Redis.csproj and tests/StackExchange.Redis.Tests/StackExchange.Redis.Tests.csproj on net10.0.

Risk

The change widens the conditions under which a connection is torn down — specifically, an in-process exception out of WriteImpl now triggers a reconnect where it previously left the connection up. That's the intended behavior (the connection was already in an undefined state), but it may surface as additional ConnectionFailed events for users who were silently hitting Fail() paths. In every case observed in the issues above, that reconnect is what users wanted to happen.

Issue: Write-side queue desync: a throw inside Message.WriteTo leaves the response queue out-of-sync with the wire #3091
Existing reports: Under low-memory conditions, StackExchange.Redis can return incorrect results to simple StringGetAsync calls #2883, Unexpected response when container is near memory limit #2804, Receiving other thread HGETALL request result on high memory pressure and multiple threads #2919, Receiving rendered redis blob instead of string content on high memory pressure #2913
Adjacent: Sentinel connection error when using the highIntegrity = true configuration #2992, Error when enabling highIntegrity configuration with Redis Sentinel #2993 (rough edges around HighIntegrity itself, which is the current workaround)

Propagate exceptions out of Message.WriteTo so PhysicalBridge's outer write path can record a connection failure, and have HandleWriteException kill the PhysicalConnection via RecordConnectionFailed. Without this, a write that throws partway through a frame (e.g. an OOM during serialization) leaves bytes on the wire while the response queue still considers the slot healthy. The next reply from the server then matches against the wrong in-flight message — the symptom seen in StackExchange#2883, StackExchange#2804, and StackExchange#2919, where commands return values intended for a different caller. HighIntegrity mode mitigates the symptom by detecting the desync after the fact via per-message echo checksums; this change addresses the underlying cause for the write-side variant. Adds three tests: - WriteTo must rethrow non-RedisCommandException out of WriteImpl, so the outer bridge catch can act on it. - RedisCommandException continues to surface unwrapped (it carries its own meaning and is excluded from the WriteTo catch filter). - End-to-end: a Message whose WriteImpl throws faults the awaiter with a RedisConnectionException(InternalFailure) AND raises a ConnectionFailed event, proving the physical connection was torn down.

Pranish-Pantha · 2026-05-27T03:33:06Z

Are some tests flaky and need rerun of workflow to pass?

mgravell · 2026-05-27T06:59:01Z

Yes, the CI can be flakey, which is not ideal. There is a plan to help thatz but it needs a lot of effort. I'll take a look.

mgravell

LGTM

mgravell · 2026-05-27T15:20:29Z

This looks like a great find, thanks; merged

mgravell approved these changes May 27, 2026

View reviewed changes

mgravell merged commit 73cac33 into StackExchange:main May 27, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tear down connection on write failure to prevent queue desync#3092

Tear down connection on write failure to prevent queue desync#3092
mgravell merged 1 commit into
StackExchange:mainfrom
Pranish-Pantha:fix/write-failure-queue-desync

Pranish-Pantha commented May 26, 2026

Uh oh!

Pranish-Pantha commented May 27, 2026

Uh oh!

mgravell commented May 27, 2026

Uh oh!

mgravell left a comment

Uh oh!

Uh oh!

mgravell commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Pranish-Pantha commented May 26, 2026

Summary

Why

Test plan

Risk

Related

Uh oh!

Pranish-Pantha commented May 27, 2026

Uh oh!

mgravell commented May 27, 2026

Uh oh!

mgravell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgravell commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants