Skip to content

Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests#15016

Merged
sundb merged 23 commits intoredis:unstablefrom
vitahlin:vtest
Apr 15, 2026
Merged

Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests#15016
sundb merged 23 commits intoredis:unstablefrom
vitahlin:vtest

Conversation

@vitahlin
Copy link
Copy Markdown
Contributor

@vitahlin vitahlin commented Apr 8, 2026

Problem

The gen_write_load helper can exit with an unhandled MOVED reply during atomic-slot-migration, specifically in Migration will be successful after fail points are cleared, which leaks Tcl stack traces into test output even though the test itself passes.

CleanShot 2026-04-08 at 23 20 35@2x

Changes

  1. Handle MOVED/ASK responses gracefully during pipeline reply reading — exit cleanly instead of crashing to stderr.
  2. Also fix the reply count tracking so the final drain loop only reads the actual number of pending replies.

Test

After commit :
CleanShot 2026-04-08 at 23 24 04@2x


Note

Low Risk
Changes are limited to Tcl test helpers and a few cluster tests; the main risk is altering load-generator behavior and masking unexpected redirect errors outside migration scenarios.

Overview
Prevents the tests/helpers/gen_write_load.tcl load generator from crashing when a cluster slot migration triggers MOVED/ASK replies while draining pipelined SET responses.

Adds an ignore_error_reply flag plumbed through start_write_load and updates atomic-slot-migration.tcl to enable it for migration scenarios; also fixes reply-drain accounting by resetting the 500-reply batch counter so the final drain reads only outstanding replies.

Reviewed by Cursor Bugbot for commit 57b6f6d. Bugbot is set up for automated code reviews on this repo. Configure here.

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Apr 8, 2026

🤖 Augment PR Summary

Summary: Makes gen_write_load exit cleanly when pipelined reads encounter MOVED/ASK during atomic slot migration.
Details: Adds redirection-aware error handling around $r read and fixes pending-reply counting so the final drain only reads outstanding replies.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread tests/helpers/gen_write_load.tcl Outdated
if {[string match {MOVED*} $err] || [string match {ASK*} $err]} {
exit 0
}
error $err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/helpers/gen_write_load.tcl:57 — Re-throwing with error $err after catch likely discards the original errorInfo/errorCode from $r read, which can make diagnosing non-MOVED failures harder. Consider preserving the original error context when propagating unexpected errors.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment thread tests/helpers/gen_write_load.tcl Outdated
# MOVED/ASK means the slot migrated to another node,
# continuing to write is pointless, exit gracefully.
if {[string match {MOVED*} $err] || [string match {ASK*} $err]} {
exit 0
Copy link
Copy Markdown
Collaborator

@sundb sundb Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't skip instead of exist? if exit directly, wouldn't ASM tests starting gen_write_load always cause it to exit?

CC @tezc

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tcl runs this proc in a separate process, so I guess it is okay to exit here. It shouldn't fail the caller test. Also, if it gets -MOVED once, it will keep getting it even if we retry.

My only concern is that caller won't be aware of this behavior and he may not realize that the load process was killed (in ASM tests or any other tests). Maybe -MOVED is something not expected in the test...

As @vitahlin mentioned, looks like there is only one test that calls gen_write_load and triggers ASM: Migration will be successful after fail points are cleared

Maybe we should just add catch block for that specific test, similar to this:

set load_handle0 [start_write_load "127.0.0.1" $port 100 $key 0 5]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion is both precise and elegant. I'll add a specific catch block for that test and use cluster_load as the condition

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitahlin I meant adding catch block in that specific test. Perhaps we can add following lines here: link

# Throws -MOVED error once asm is completed, catch block will ignore it.
catch {
    set load_handle [start_write_load "127.0.0.1" [get_port 1] 100 $slot0_key]
}

I feel like this is a simpler solution than adding a parameter to start_write_load. Because we need to handle -MOVED case in a few tests only, so perhaps we can do that explicitly in the test just by covering start_write_load with a catch block? wydt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tezc this is a subprocess, not sure catch can catch it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm not a TCL expert. If that is the case, these catch blocks don't make any sense as well 🤦‍♂️ I'll check these later on.

@vitahlin @sundb sorry for the trouble, I was mistaken about how the subprocesses work in TCL.

Rethinking this again, now I feel closer to the first version. We can just skip/ignore -MOVED reply implicitly without a parameter. (No need to call exit 0 as well. Killing the load process after moving slots could be caller's responsibility). Leaving the final decision to you guys.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

I’ve re-checked the tests and I completely agree—it’s much safer to keep the load process lifecycle under the caller’s control. And it exist in line:

stop_write_load $load_handle

I also see your point about gen_write_load. Given that start_write_load is a core utility for many non-cluster tests, I’m concerned that applying global MOVED/ASK tolerance might inadvertently affect other test suites.

To stay on the safe side, I’ve decided to keep the default behavior strict and only enable the redirection tolerance within the specific ASM migration test path. This way, we get the fix we need for this case without introducing potential side effects elsewhere.

I am happy to revisit the generalization later. However, if you believe the implicit handling is still the better approach despite my concerns, I’m more than happy to follow your lead on this.

Comment thread tests/unit/cluster/atomic-slot-migration.tcl Outdated
@vitahlin
Copy link
Copy Markdown
Contributor Author

In Simple slot migration with write load, the previous catch around start_write_load could not handle MOVED from the background process, so enabling cluster_load=1` keeps write load stable during slot redirects.

Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment thread tests/helpers/gen_write_load.tcl Outdated
vitahlin and others added 2 commits April 14, 2026 09:40
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 28474a7. Configure here.

Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment thread tests/helpers/gen_write_load.tcl Outdated
error $err
}
}
set count 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missed by #14946?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14946 didn't handle the count value, the exception handling problem is an existing legacy issue

Comment thread tests/helpers/gen_write_load.tcl Outdated
Comment on lines +87 to +90
set ignore_error_reply 0
if {[llength $argv] > 7} {
set ignore_error_reply [lindex $argv 7]
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the $argv[7] always have a default value?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@sundb
Copy link
Copy Markdown
Collaborator

sundb commented Apr 14, 2026

@vitahlin pls run a fully CI again, thx.

@vitahlin
Copy link
Copy Markdown
Contributor Author

The latest commit use tcl opts to preserve original errorInfo, keeps the error info like before:
CleanShot 2026-04-14 at 15 36 02@2x

If not use opts, the error info like that:
CleanShot 2026-04-14 at 15 41 30@2x

@vitahlin
Copy link
Copy Markdown
Contributor Author

vitahlin commented Apr 14, 2026

Fully daily CI: https://github.com/vitahlin/redis/actions/runs/24393004886

The test-ubuntu-latest job failed with the following error, but it seems unrelated to this PR

!!! WARNING The following tests failed:

*** [err]: HOTKEYS detection with biased key access, sample ratio = 1000 in tests/unit/hotkeys.tcl
Expected '5' to be more than '5' (context: type eval line 58 cmd {assert_morethan $res 5} proc ::test)
Cleanup: may take some time... OK
Error: Process completed with exit code 1.

@sundb sundb changed the title Test fix gen_write_load crash on MOVED during atomic-slot-migration tests Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests Apr 15, 2026
@sundb sundb merged commit 3cd4642 into redis:unstable Apr 15, 2026
25 of 26 checks passed
@vitahlin vitahlin deleted the vtest branch April 15, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants