Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests by vitahlin · Pull Request #15016 · redis/redis

vitahlin · 2026-04-08T15:24:52Z

Problem

The gen_write_load helper can exit with an unhandled MOVED reply during atomic-slot-migration, specifically in Migration will be successful after fail points are cleared, which leaks Tcl stack traces into test output even though the test itself passes.

Changes

Handle MOVED/ASK responses gracefully during pipeline reply reading — exit cleanly instead of crashing to stderr.
Also fix the reply count tracking so the final drain loop only reads the actual number of pending replies.

Test

After commit :

Note

Low Risk
Changes are limited to Tcl test helpers and a few cluster tests; the main risk is altering load-generator behavior and masking unexpected redirect errors outside migration scenarios.

Overview
Prevents the tests/helpers/gen_write_load.tcl load generator from crashing when a cluster slot migration triggers MOVED/ASK replies while draining pipelined SET responses.

Adds an ignore_error_reply flag plumbed through start_write_load and updates atomic-slot-migration.tcl to enable it for migration scenarios; also fixes reply-drain accounting by resetting the 500-reply batch counter so the final drain reads only outstanding replies.

^{Reviewed by Cursor Bugbot for commit 57b6f6d. Bugbot is set up for automated code reviews on this repo. Configure here.}

This reverts commit 91d07aa.

This reverts commit b76024e.

augmentcode · 2026-04-08T15:27:13Z

🤖 Augment PR Summary

Summary: Makes gen_write_load exit cleanly when pipelined reads encounter MOVED/ASK during atomic slot migration.
Details: Adds redirection-aware error handling around $r read and fixes pending-reply counting so the final drain only reads outstanding replies.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-04-08T15:27:14Z

+                    if {[string match {MOVED*} $err] || [string match {ASK*} $err]} {
+                        exit 0
+                    }
+                    error $err


tests/helpers/gen_write_load.tcl:57 — Re-throwing with error $err after catch likely discards the original errorInfo/errorCode from $r read, which can make diagnosing non-MOVED failures harder. Consider preserving the original error context when propagating unexpected errors.

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

sundb · 2026-04-09T00:39:03Z

+                    # MOVED/ASK means the slot migrated to another node,
+                    # continuing to write is pointless, exit gracefully.
+                    if {[string match {MOVED*} $err] || [string match {ASK*} $err]} {
+                        exit 0


shouldn't skip instead of exist? if exit directly, wouldn't ASM tests starting gen_write_load always cause it to exit?

CC @tezc

tcl runs this proc in a separate process, so I guess it is okay to exit here. It shouldn't fail the caller test. Also, if it gets -MOVED once, it will keep getting it even if we retry.

My only concern is that caller won't be aware of this behavior and he may not realize that the load process was killed (in ASM tests or any other tests). Maybe -MOVED is something not expected in the test...

As @vitahlin mentioned, looks like there is only one test that calls gen_write_load and triggers ASM: Migration will be successful after fail points are cleared

Maybe we should just add catch block for that specific test, similar to this:

redis/tests/unit/cluster/atomic-slot-migration.tcl

Line 586 in e97fe24

set load_handle0 [start_write_load "127.0.0.1" $port 100 $key 0 5]

Your suggestion is both precise and elegant. I'll add a specific catch block for that test and use cluster_load as the condition

@vitahlin I meant adding catch block in that specific test. Perhaps we can add following lines here: link

# Throws -MOVED error once asm is completed, catch block will ignore it. catch { set load_handle [start_write_load "127.0.0.1" [get_port 1] 100 $slot0_key] }

I feel like this is a simpler solution than adding a parameter to start_write_load. Because we need to handle -MOVED case in a few tests only, so perhaps we can do that explicitly in the test just by covering start_write_load with a catch block? wydt?

@tezc this is a subprocess, not sure catch can catch it.

Okay, I'm not a TCL expert. If that is the case, these catch blocks don't make any sense as well 🤦‍♂️ I'll check these later on.

@vitahlin @sundb sorry for the trouble, I was mistaken about how the subprocesses work in TCL.

Rethinking this again, now I feel closer to the first version. We can just skip/ignore -MOVED reply implicitly without a parameter. (No need to call exit 0 as well. Killing the load process after moving slots could be caller's responsibility). Leaving the final decision to you guys.

Thanks for the feedback.

I’ve re-checked the tests and I completely agree—it’s much safer to keep the load process lifecycle under the caller’s control. And it exist in line:

redis/tests/unit/cluster/atomic-slot-migration.tcl

Line 817 in e1d35ac

stop_write_load $load_handle

I also see your point about gen_write_load. Given that start_write_load is a core utility for many non-cluster tests, I’m concerned that applying global MOVED/ASK tolerance might inadvertently affect other test suites.

To stay on the safe side, I’ve decided to keep the default behavior strict and only enable the redirection tolerance within the specific ASM migration test path. This way, we get the fix we need for this case without introducing potential side effects elsewhere.

I am happy to revisit the generalization later. However, if you believe the implicit handling is still the better approach despite my concerns, I’m more than happy to follow your lead on this.

This reverts commit 4c732fb.

This reverts commit eabb4af.

This reverts commit c65c9b9.

This reverts commit c56d4f0.

This reverts commit 7ae28b2.

vitahlin · 2026-04-13T09:01:24Z

In Simple slot migration with write load, the previous catch around start_write_load could not handle MOVED from the background process, so enabling cluster_load=1` keeps write load stable during slot redirects.

Co-authored-by: debing.sun <debing.sun@redis.com>

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Reviewed by Cursor Bugbot for commit 28474a7. Configure here.}

sundb · 2026-04-14T02:13:58Z

+                    error $err
+                }
            }
+            set count 0


this is missed by #14946?

#14946 didn't handle the count value, the exception handling problem is an existing legacy issue

sundb · 2026-04-14T06:05:33Z

+set ignore_error_reply 0
+if {[llength $argv] > 7} {
+    set ignore_error_reply [lindex $argv 7]
+}


does the $argv[7] always have a default value?

sundb · 2026-04-14T06:30:27Z

@vitahlin pls run a fully CI again, thx.

vitahlin · 2026-04-14T07:43:11Z

The latest commit use tcl opts to preserve original errorInfo, keeps the error info like before:

If not use opts, the error info like that:

vitahlin · 2026-04-14T16:00:26Z

Fully daily CI: https://github.com/vitahlin/redis/actions/runs/24393004886

The test-ubuntu-latest job failed with the following error, but it seems unrelated to this PR

!!! WARNING The following tests failed:

*** [err]: HOTKEYS detection with biased key access, sample ratio = 1000 in tests/unit/hotkeys.tcl
Expected '5' to be more than '5' (context: type eval line 58 cmd {assert_morethan $res 5} proc ::test)
Cleanup: may take some time... OK
Error: Process completed with exit code 1.

vitahlin added 6 commits April 8, 2026 22:50

fix --only

b76024e

fx

91d07aa

Revert "fx"

0cb0da6

This reverts commit 91d07aa.

fx

eabb4af

Revert "fix --only"

4c732fb

This reverts commit b76024e.

Merge branch 'unstable' into vtest

0cd2801

augmentcode Bot reviewed Apr 8, 2026

View reviewed changes

sundb reviewed Apr 9, 2026

View reviewed changes

vitahlin added 5 commits April 10, 2026 17:34

Reapply "fix --only"

c65c9b9

This reverts commit 4c732fb.

Revert "fx"

621c18c

This reverts commit eabb4af.

fix remain count

376faa2

test add cluster_load for proc gen_write_load

772af3d

Revert "Reapply "fix --only""

c56d4f0

This reverts commit c65c9b9.

cursor Bot reviewed Apr 11, 2026

View reviewed changes

Comment thread tests/unit/cluster/atomic-slot-migration.tcl Outdated

vitahlin added 5 commits April 12, 2026 00:57

fix cluster_load

2bedbc5

Reapply "Reapply "fix --only""

7ae28b2

This reverts commit c56d4f0.

do not exit in gen_write_load

bf22e8c

Revert "Reapply "Reapply "fix --only"""

b1e4cc1

This reverts commit 7ae28b2.

use cluster_load for migration write traffic

e7f5b76

sundb reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/helpers/gen_write_load.tcl Outdated

Comment thread tests/helpers/gen_write_load.tcl Outdated

vitahlin and others added 2 commits April 14, 2026 09:40

Update tests/helpers/gen_write_load.tcl

5fbc4c5

Co-authored-by: debing.sun <debing.sun@redis.com>

Update tests/helpers/gen_write_load.tcl

28474a7

Co-authored-by: debing.sun <debing.sun@redis.com>

cursor Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread tests/helpers/gen_write_load.tcl Outdated

Comment thread tests/helpers/gen_write_load.tcl Outdated

Comment thread tests/helpers/gen_write_load.tcl Outdated

Comment thread tests/helpers/gen_write_load.tcl Outdated

vitahlin added 2 commits April 14, 2026 09:51

fix tcl and comment

d08e8a5

Fix tcl

46c4581

sundb reviewed Apr 14, 2026

View reviewed changes

fx

1b9bce3

fx

599448e

use tcl opts to preserve original errorInfo

57b6f6d

sundb approved these changes Apr 15, 2026

View reviewed changes

sundb changed the title ~~Test fix gen_write_load crash on MOVED during atomic-slot-migration tests~~ Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests Apr 15, 2026

sundb merged commit 3cd4642 into redis:unstable Apr 15, 2026
25 of 26 checks passed

vitahlin deleted the vtest branch April 15, 2026 01:32

Conversation

vitahlin commented Apr 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Test

Uh oh!

augmentcode Bot commented Apr 8, 2026

Uh oh!

augmentcode Bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sundb Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vitahlin commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sundb commented Apr 14, 2026

Uh oh!

vitahlin commented Apr 14, 2026

Uh oh!

vitahlin commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vitahlin commented Apr 8, 2026 •

edited by cursor Bot

Loading

sundb Apr 9, 2026 •

edited

Loading

vitahlin commented Apr 14, 2026 •

edited

Loading