Tear down connection on write failure to prevent queue desync#3092
Merged
mgravell merged 1 commit intoMay 27, 2026
Merged
Conversation
Propagate exceptions out of Message.WriteTo so PhysicalBridge's outer write path can record a connection failure, and have HandleWriteException kill the PhysicalConnection via RecordConnectionFailed. Without this, a write that throws partway through a frame (e.g. an OOM during serialization) leaves bytes on the wire while the response queue still considers the slot healthy. The next reply from the server then matches against the wrong in-flight message — the symptom seen in StackExchange#2883, StackExchange#2804, and StackExchange#2919, where commands return values intended for a different caller. HighIntegrity mode mitigates the symptom by detecting the desync after the fact via per-message echo checksums; this change addresses the underlying cause for the write-side variant. Adds three tests: - WriteTo must rethrow non-RedisCommandException out of WriteImpl, so the outer bridge catch can act on it. - RedisCommandException continues to surface unwrapped (it carries its own meaning and is excluded from the WriteTo catch filter). - End-to-end: a Message whose WriteImpl throws faults the awaiter with a RedisConnectionException(InternalFailure) AND raises a ConnectionFailed event, proving the physical connection was torn down.
Contributor
Author
|
Are some tests flaky and need rerun of workflow to pass? |
Collaborator
|
Yes, the CI can be flakey, which is not ideal. There is a plan to help thatz but it needs a lot of effort. I'll take a look. |
Collaborator
|
This looks like a great find, thanks; merged |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Message.WriteTonow re-throws after callingFail(), soPhysicalBridge's outer write path can act on the failure instead of seeing a phantom success.PhysicalBridge.HandleWriteExceptiontakes thePhysicalConnectionand callsRecordConnectionFailed(ConnectionFailureType.InternalFailure, ...), tearing the connection down and draining the response queue with failures.HandleWriteException(sync, async, backlog, post-flush completion) thread the connection through.Fixes the underlying cause behind the symptoms reported in #2883, #2804, #2919, #2913 — without depending on
HighIntegritymode (which catches the desync after the fact rather than preventing it).See accompanying issue for the full root-cause writeup.
Why
If
WriteImplthrows partway through a frame (OOM during serialization is the most-reported trigger, but any exception out ofWriteImpldoes it), the oldWriteToswallowed the exception. The bridge then sawWriteResult.Success, kept the message at the head of the in-flight queue, and left the socket connected. Any subsequent reply from the server matched against the wrong message.Fail()signals the broken message's awaiter, but it does nothing about the wire state or the sibling messages whose ordering has now been corrupted. Tearing down the physical connection completes all in-flight awaiters withRedisConnectionException(InternalFailure)and forces a clean reconnect.Test plan
WriteTo_PropagatesWriteImplException— fails onmain, passes here. AssertsWriteTorethrows non-RedisCommandExceptionfromWriteImpl.WriteTo_DoesNotWrapRedisCommandException—RedisCommandExceptioncontinues to surface unwrapped (it's excluded from the catch filter).WriteFailure_TearsDownPhysicalConnection— end-to-end: aMessagewhoseWriteImplthrows faults the awaiter withRedisConnectionException(InternalFailure)and raises aConnectionFailedevent withInternalFailure.dotnet buildclean forsrc/StackExchange.Redis/StackExchange.Redis.csprojandtests/StackExchange.Redis.Tests/StackExchange.Redis.Tests.csprojonnet10.0.Risk
The change widens the conditions under which a connection is torn down — specifically, an in-process exception out of
WriteImplnow triggers a reconnect where it previously left the connection up. That's the intended behavior (the connection was already in an undefined state), but it may surface as additionalConnectionFailedevents for users who were silently hittingFail()paths. In every case observed in the issues above, that reconnect is what users wanted to happen.Related
Message.WriteToleaves the response queue out-of-sync with the wire #3091HighIntegrityitself, which is the current workaround)