Fix flaky TestServer_RedundantUpdateSuppression.#8839
Fix flaky TestServer_RedundantUpdateSuppression.#8839Pranjali-2501 merged 3 commits intogrpc:masterfrom
TestServer_RedundantUpdateSuppression.#8839Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #8839 +/- ##
==========================================
- Coverage 83.22% 83.16% -0.07%
==========================================
Files 417 414 -3
Lines 32920 32776 -144
==========================================
- Hits 27397 27257 -140
+ Misses 4108 4079 -29
- Partials 1415 1440 +25 🚀 New features to boost your workflow:
|
|
@Pranjali-2501 have you verified that the flakiness is indeed caused by the proto comparison and that the changes in this PR fix the flakiness? If so, can you mention the verification performed in the PR description? |
No, I wasn't able to reproduce the exact flake locally and on forge also. Why I have verified that this change maintains correctness for valid updates and doesn't introduce any regressions in the existing test suite. Since I'm not sure whether the flake is because of this, we can monitor the Flaky test on github for sometime after that PR get merged. |
|
I was able to repro the failure in a 2*10^5 runs on forge. Shared the command offline. Can you use debug logs to verify that the serialized bytes are different but the comparison using From the logs, it seems to me like a race in the channel state propagation. |
proto.Equal for xDS resources equality check.TestServer_RedundantUpdateSuppression.
I have updated the PR to remove the server-side proto.Equal changes, as they turned out to be unnecessary. The root cause of the flake was a race condition in the test itself: it was checking for connectivity state changes before the client connection had fully stabilized. I added |
arjan-bal
left a comment
There was a problem hiding this comment.
LGTM, nice find! Please address the remaining comment before merging.
| testutils.AwaitState(ctx, t, cc, connectivity.Ready) | ||
| errCh := make(chan error, 1) | ||
| go func() { | ||
| prev := connectivity.Ready // We know we are READY since we just did an RPC. |
There was a problem hiding this comment.
We should remove this trailing comment as it is technically incorrect. Once an LB policy produces a READY picker, two serial events occur:
- The picker updates (unblocking queued RPCs).
- The channel state updates.
This test appears to hit a race condition where it inspects the state between these two steps—observing a successful RPC while the channel state is still CONNECTING.
There was a problem hiding this comment.
Removed the trailing comment.
Fixes grpc#7713 The `TestServer_RedundantUpdateSuppression` test was flaky because it began monitoring ClientConn state transitions before the connection had fully stabilized in the expected Ready state. This created a race condition where the monitoring loop could capture initial setup transitions (e.g., Connecting) and falsely report them as unexpected failures caused by the redundant update. Successfully run the test on forge for 1 million times without any flake. RELEASE NOTES: N/A
Fixes #7713
The
TestServer_RedundantUpdateSuppressiontest was flaky because it began monitoring ClientConn state transitions before the connection had fully stabilized in the expected Ready state.This created a race condition where the monitoring loop could capture initial setup transitions (e.g., Connecting) and falsely report them as unexpected failures caused by the redundant update.
Successfully run the test on forge for 1 million times without any flake.
RELEASE NOTES: N/A