Skip to content

Fix flaky TestServer_RedundantUpdateSuppression.#8839

Merged
Pranjali-2501 merged 3 commits intogrpc:masterfrom
Pranjali-2501:issue_7713
Jan 19, 2026
Merged

Fix flaky TestServer_RedundantUpdateSuppression.#8839
Pranjali-2501 merged 3 commits intogrpc:masterfrom
Pranjali-2501:issue_7713

Conversation

@Pranjali-2501
Copy link
Copy Markdown
Contributor

@Pranjali-2501 Pranjali-2501 commented Jan 15, 2026

Fixes #7713

The TestServer_RedundantUpdateSuppression test was flaky because it began monitoring ClientConn state transitions before the connection had fully stabilized in the expected Ready state.

This created a race condition where the monitoring loop could capture initial setup transitions (e.g., Connecting) and falsely report them as unexpected failures caused by the redundant update.

Successfully run the test on forge for 1 million times without any flake.

RELEASE NOTES: N/A

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.16%. Comparing base (629ef39) to head (f288aaf).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8839      +/-   ##
==========================================
- Coverage   83.22%   83.16%   -0.07%     
==========================================
  Files         417      414       -3     
  Lines       32920    32776     -144     
==========================================
- Hits        27397    27257     -140     
+ Misses       4108     4079      -29     
- Partials     1415     1440      +25     

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal
Copy link
Copy Markdown
Contributor

arjan-bal commented Jan 16, 2026

@Pranjali-2501 have you verified that the flakiness is indeed caused by the proto comparison and that the changes in this PR fix the flakiness? If so, can you mention the verification performed in the PR description?

@arjan-bal arjan-bal assigned Pranjali-2501 and unassigned arjan-bal Jan 16, 2026
@Pranjali-2501
Copy link
Copy Markdown
Contributor Author

Pranjali-2501 commented Jan 16, 2026

@Pranjali-2501 have you verified that the flakiness is indeed caused by the protoc comparison and that the changes in this PR fix the flakiness? If so, can you mention the verification performed in the PR description?

No, I wasn't able to reproduce the exact flake locally and on forge also.
However, looking at the test failure logs, it seems the issue arises when the client receives a redundant update. This can be caused because of using bytes.Equal.

Why bytes.Equal is causing the flake: proto.Marshal is not deterministic. Specifically, Go randomizes map iteration order at runtime to prevent hash collisions. This means identical resources can produce different byte arrays in different runs .
When this happens, bytes.Equal returns false for identical data. Swapping to proto.Equal fixes this by performing a semantic comparison rather than a byte-for-byte one.

I have verified that this change maintains correctness for valid updates and doesn't introduce any regressions in the existing test suite.

Since I'm not sure whether the flake is because of this, we can monitor the Flaky test on github for sometime after that PR get merged.

@arjan-bal
Copy link
Copy Markdown
Contributor

arjan-bal commented Jan 16, 2026

I was able to repro the failure in a 2*10^5 runs on forge. Shared the command offline. Can you use debug logs to verify that the serialized bytes are different but the comparison using proto.Equal returns true?

From the logs, it seems to me like a race in the channel state propagation.

@arjan-bal arjan-bal assigned Pranjali-2501 and unassigned arjan-bal Jan 16, 2026
@Pranjali-2501 Pranjali-2501 changed the title xds: use proto.Equal for xDS resources equality check. Fix flaky TestServer_RedundantUpdateSuppression. Jan 16, 2026
@Pranjali-2501
Copy link
Copy Markdown
Contributor Author

I was able to repro the failure in a 2*10^5 runs on forge. Shared the command offline. Can you use debug logs to verify that the serialized bytes are different but the comparison using proto.Equal returns true?

From the logs, it seems to me like a race in the channel state propagation.

I have updated the PR to remove the server-side proto.Equal changes, as they turned out to be unnecessary.

The root cause of the flake was a race condition in the test itself: it was checking for connectivity state changes before the client connection had fully stabilized. I added testutils.AwaitState(ctx, t, cc, connectivity.Ready) to ensure strictly synchronized monitoring. The test now passes reliably without modifying the server logic.

Copy link
Copy Markdown
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice find! Please address the remaining comment before merging.

Comment thread xds/server_resource_ext_test.go Outdated
testutils.AwaitState(ctx, t, cc, connectivity.Ready)
errCh := make(chan error, 1)
go func() {
prev := connectivity.Ready // We know we are READY since we just did an RPC.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this trailing comment as it is technically incorrect. Once an LB policy produces a READY picker, two serial events occur:

  1. The picker updates (unblocking queued RPCs).
  2. The channel state updates.

This test appears to hit a race condition where it inspects the state between these two steps—observing a successful RPC while the channel state is still CONNECTING.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the trailing comment.

@arjan-bal arjan-bal assigned Pranjali-2501 and unassigned arjan-bal Jan 19, 2026
@Pranjali-2501 Pranjali-2501 merged commit bd4444a into grpc:master Jan 19, 2026
14 checks passed
mbissa pushed a commit to mbissa/grpc-go that referenced this pull request Feb 16, 2026
Fixes grpc#7713

The `TestServer_RedundantUpdateSuppression` test was flaky because it
began monitoring ClientConn state transitions before the connection had
fully stabilized in the expected Ready state.

This created a race condition where the monitoring loop could capture
initial setup transitions (e.g., Connecting) and falsely report them as
unexpected failures caused by the redundant update.

Successfully run the test on forge for 1 million times without any
flake.

RELEASE NOTES: N/A
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: 4/100K: Test/Server_RedundantUpdateSuppression

2 participants