Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests#15016
Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests#15016sundb merged 23 commits intoredis:unstablefrom
Conversation
🤖 Augment PR SummarySummary: Makes 🤖 Was this summary useful? React with 👍 or 👎 |
| if {[string match {MOVED*} $err] || [string match {ASK*} $err]} { | ||
| exit 0 | ||
| } | ||
| error $err |
There was a problem hiding this comment.
tests/helpers/gen_write_load.tcl:57 — Re-throwing with error $err after catch likely discards the original errorInfo/errorCode from $r read, which can make diagnosing non-MOVED failures harder. Consider preserving the original error context when propagating unexpected errors.
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| # MOVED/ASK means the slot migrated to another node, | ||
| # continuing to write is pointless, exit gracefully. | ||
| if {[string match {MOVED*} $err] || [string match {ASK*} $err]} { | ||
| exit 0 |
There was a problem hiding this comment.
shouldn't skip instead of exist? if exit directly, wouldn't ASM tests starting gen_write_load always cause it to exit?
CC @tezc
There was a problem hiding this comment.
tcl runs this proc in a separate process, so I guess it is okay to exit here. It shouldn't fail the caller test. Also, if it gets -MOVED once, it will keep getting it even if we retry.
My only concern is that caller won't be aware of this behavior and he may not realize that the load process was killed (in ASM tests or any other tests). Maybe -MOVED is something not expected in the test...
As @vitahlin mentioned, looks like there is only one test that calls gen_write_load and triggers ASM: Migration will be successful after fail points are cleared
Maybe we should just add catch block for that specific test, similar to this:
There was a problem hiding this comment.
Your suggestion is both precise and elegant. I'll add a specific catch block for that test and use cluster_load as the condition
There was a problem hiding this comment.
@vitahlin I meant adding catch block in that specific test. Perhaps we can add following lines here: link
# Throws -MOVED error once asm is completed, catch block will ignore it.
catch {
set load_handle [start_write_load "127.0.0.1" [get_port 1] 100 $slot0_key]
}
I feel like this is a simpler solution than adding a parameter to start_write_load. Because we need to handle -MOVED case in a few tests only, so perhaps we can do that explicitly in the test just by covering start_write_load with a catch block? wydt?
There was a problem hiding this comment.
@tezc this is a subprocess, not sure catch can catch it.
There was a problem hiding this comment.
Okay, I'm not a TCL expert. If that is the case, these catch blocks don't make any sense as well 🤦♂️ I'll check these later on.
@vitahlin @sundb sorry for the trouble, I was mistaken about how the subprocesses work in TCL.
Rethinking this again, now I feel closer to the first version. We can just skip/ignore -MOVED reply implicitly without a parameter. (No need to call exit 0 as well. Killing the load process after moving slots could be caller's responsibility). Leaving the final decision to you guys.
There was a problem hiding this comment.
Thanks for the feedback.
I’ve re-checked the tests and I completely agree—it’s much safer to keep the load process lifecycle under the caller’s control. And it exist in line:
I also see your point about gen_write_load. Given that start_write_load is a core utility for many non-cluster tests, I’m concerned that applying global MOVED/ASK tolerance might inadvertently affect other test suites.
To stay on the safe side, I’ve decided to keep the default behavior strict and only enable the redirection tolerance within the specific ASM migration test path. This way, we get the fix we need for this case without introducing potential side effects elsewhere.
I am happy to revisit the generalization later. However, if you believe the implicit handling is still the better approach despite my concerns, I’m more than happy to follow your lead on this.
|
In |
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Reviewed by Cursor Bugbot for commit 28474a7. Configure here.
| error $err | ||
| } | ||
| } | ||
| set count 0 |
There was a problem hiding this comment.
#14946 didn't handle the count value, the exception handling problem is an existing legacy issue
| set ignore_error_reply 0 | ||
| if {[llength $argv] > 7} { | ||
| set ignore_error_reply [lindex $argv 7] | ||
| } |
There was a problem hiding this comment.
does the $argv[7] always have a default value?
|
@vitahlin pls run a fully CI again, thx. |
|
Fully daily CI: https://github.com/vitahlin/redis/actions/runs/24393004886 The !!! WARNING The following tests failed:
*** [err]: HOTKEYS detection with biased key access, sample ratio = 1000 in tests/unit/hotkeys.tcl
Expected '5' to be more than '5' (context: type eval line 58 cmd {assert_morethan $res 5} proc ::test)
Cleanup: may take some time... OK
Error: Process completed with exit code 1. |



Problem
The
gen_write_loadhelper can exit with an unhandledMOVEDreply duringatomic-slot-migration, specifically inMigration will be successful after fail points are cleared, which leaks Tcl stack traces into test output even though the test itself passes.Changes
Test
After commit :

Note
Low Risk
Changes are limited to Tcl test helpers and a few cluster tests; the main risk is altering load-generator behavior and masking unexpected redirect errors outside migration scenarios.
Overview
Prevents the
tests/helpers/gen_write_load.tclload generator from crashing when a cluster slot migration triggersMOVED/ASKreplies while draining pipelinedSETresponses.Adds an
ignore_error_replyflag plumbed throughstart_write_loadand updatesatomic-slot-migration.tclto enable it for migration scenarios; also fixes reply-drain accounting by resetting the 500-reply batch counter so the final drain reads only outstanding replies.Reviewed by Cursor Bugbot for commit 57b6f6d. Bugbot is set up for automated code reviews on this repo. Configure here.