Skip to content

Tear down connection on write failure to prevent queue desync#3092

Merged
mgravell merged 1 commit into
StackExchange:mainfrom
Pranish-Pantha:fix/write-failure-queue-desync
May 27, 2026
Merged

Tear down connection on write failure to prevent queue desync#3092
mgravell merged 1 commit into
StackExchange:mainfrom
Pranish-Pantha:fix/write-failure-queue-desync

Conversation

@Pranish-Pantha
Copy link
Copy Markdown
Contributor

Summary

  • Message.WriteTo now re-throws after calling Fail(), so PhysicalBridge's outer write path can act on the failure instead of seeing a phantom success.
  • PhysicalBridge.HandleWriteException takes the PhysicalConnection and calls RecordConnectionFailed(ConnectionFailureType.InternalFailure, ...), tearing the connection down and draining the response queue with failures.
  • All callers of HandleWriteException (sync, async, backlog, post-flush completion) thread the connection through.

Fixes the underlying cause behind the symptoms reported in #2883, #2804, #2919, #2913 — without depending on HighIntegrity mode (which catches the desync after the fact rather than preventing it).

See accompanying issue for the full root-cause writeup.

Why

If WriteImpl throws partway through a frame (OOM during serialization is the most-reported trigger, but any exception out of WriteImpl does it), the old WriteTo swallowed the exception. The bridge then saw WriteResult.Success, kept the message at the head of the in-flight queue, and left the socket connected. Any subsequent reply from the server matched against the wrong message.

Fail() signals the broken message's awaiter, but it does nothing about the wire state or the sibling messages whose ordering has now been corrupted. Tearing down the physical connection completes all in-flight awaiters with RedisConnectionException(InternalFailure) and forces a clean reconnect.

Test plan

  • New WriteTo_PropagatesWriteImplException — fails on main, passes here. Asserts WriteTo rethrows non-RedisCommandException from WriteImpl.
  • New WriteTo_DoesNotWrapRedisCommandExceptionRedisCommandException continues to surface unwrapped (it's excluded from the catch filter).
  • New WriteFailure_TearsDownPhysicalConnection — end-to-end: a Message whose WriteImpl throws faults the awaiter with RedisConnectionException(InternalFailure) and raises a ConnectionFailed event with InternalFailure.
  • Verified the new tests fail on the unfixed source and pass with the fix applied (stash/unstash cycle).
  • dotnet build clean for src/StackExchange.Redis/StackExchange.Redis.csproj and tests/StackExchange.Redis.Tests/StackExchange.Redis.Tests.csproj on net10.0.

Risk

The change widens the conditions under which a connection is torn down — specifically, an in-process exception out of WriteImpl now triggers a reconnect where it previously left the connection up. That's the intended behavior (the connection was already in an undefined state), but it may surface as additional ConnectionFailed events for users who were silently hitting Fail() paths. In every case observed in the issues above, that reconnect is what users wanted to happen.

Related

Propagate exceptions out of Message.WriteTo so PhysicalBridge's outer
write path can record a connection failure, and have HandleWriteException
kill the PhysicalConnection via RecordConnectionFailed.

Without this, a write that throws partway through a frame (e.g. an OOM
during serialization) leaves bytes on the wire while the response queue
still considers the slot healthy. The next reply from the server then
matches against the wrong in-flight message — the symptom seen in StackExchange#2883,
StackExchange#2804, and StackExchange#2919, where commands return values intended for a different
caller. HighIntegrity mode mitigates the symptom by detecting the desync
after the fact via per-message echo checksums; this change addresses the
underlying cause for the write-side variant.

Adds three tests:
 - WriteTo must rethrow non-RedisCommandException out of WriteImpl, so
   the outer bridge catch can act on it.
 - RedisCommandException continues to surface unwrapped (it carries its
   own meaning and is excluded from the WriteTo catch filter).
 - End-to-end: a Message whose WriteImpl throws faults the awaiter with
   a RedisConnectionException(InternalFailure) AND raises a
   ConnectionFailed event, proving the physical connection was torn down.
@Pranish-Pantha
Copy link
Copy Markdown
Contributor Author

Are some tests flaky and need rerun of workflow to pass?

@mgravell
Copy link
Copy Markdown
Collaborator

Yes, the CI can be flakey, which is not ideal. There is a plan to help thatz but it needs a lot of effort. I'll take a look.

Copy link
Copy Markdown
Collaborator

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgravell mgravell merged commit 73cac33 into StackExchange:main May 27, 2026
7 of 9 checks passed
@mgravell
Copy link
Copy Markdown
Collaborator

This looks like a great find, thanks; merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants