Skip to content

Bug: RedisNodeNotFoundException in cluster mode due to incorrect slot calculation #7026

@ngyngcphu

Description

@ngyngcphu

Redis version

6.2.7

Redisson version

4.3.0

Redisson configuration

public class RedissonConfig {

  @Value("${redis.host}")
  private String redisHost;

  @Value("${redis.port}")
  private int redisPort;

  @Value("${redis.password:}")
  private String redisPassword;

  @Bean
  public RedissonClient redissonClient() {
    Config config = new Config();

    if (!redisPassword.isEmpty()) {
      config.setPassword(redisPassword);
    }

    config.useClusterServers()
        .addNodeAddress("redis://" + redisHost + ":" + redisPort)
        .setScanInterval(2000)
        .setConnectTimeout(10000)
        .setTimeout(15000)
        .setRetryAttempts(3)
        .setReadMode(ReadMode.MASTER);

    return Redisson.create(config);
  }
}

What is the Expected behavior?

When using Redisson with redis cluster, all operations should work correctly:

// RRemoteService - should successfully call remote methods
RRemoteService service = redisson.getRemoteService("my-service");
HelloService client = service.get(HelloService.class);
String result = client.sayHello("World");  // Should return "Hello World"

// RExecutorService - should successfully submit tasks
RExecutorService executor = redisson.getExecutorService("my-executor");
executor.submit(() -> System.out.println("Task executed"));  // Should submit without error
executor.shutdown();  // Should shutdown gracefully

Expected: Operations complete successfully with correct slot routing in cluster mode.

What is the Actual behavior?

All operations fail with RedisNodeNotFoundException in redis cluster mode:

Error: RedisNodeNotFoundException: Node: NodeSource [slot=5570, addr=redis://172.16.23.74:6379, redisClient=null, redirect=MOVED, entry=null] hasn't been discovered yet. Increase 'retryAttempts' setting. Last cluster nodes topology: f7a99453e22ea56853f11207d34b75d7bae808c8 10.50.***.***:31638@16379 slave f0dcbeadb242d167ea5c3d194e6bfb9e6a6377c9 0 1775444275000 2450 connected
09356c63096644328aa44fe89d29c66b7c2f4f4f 10.50.***.***:30397@16379 master - 0 1775444276336 2441 connected 0-5460
143cf62dad9ba03f221932c2b6afb985f27d7770 10.50.***.***:32213@16379 slave a696e298fd0ce310d35b00f1a23fb775428f7012 0 1775444274000 2451 connected
a696e298fd0ce310d35b00f1a23fb775428f7012 10.50.***.***:30246@16379 master - 0 1775444275327 2451 connected 5461-10922
c4fd20ba43567200c841b4a1eb8157d012206120 10.50.***.***:30553@16379 slave 09356c63096644328aa44fe89d29c66b7c2f4f4f 0 1775444273000 2441 connected
f0dcbeadb242d167ea5c3d194e6bfb9e6a6377c9 10.50.***.***:30375@16379 myself,master - 0 1775444274000 2450 connected 10923-16383

I ran into this issue after deploying my spring boot apps to K8s with redis cluster. I had 6 redis nodes (3 masters, 3 slaves) exposed via NodePort. Then I tried to use RRemoteService or RExecutorService, things broke immediately with RedisNodeNotFoundException.

No failover was happening. Normally this exception occurs during cluster rebalancing and recovers after scanInterval. But here it happened 100% for specific operations, the app kept calculating the wrong slot and connecting to the wrong node.

I noticed the slot number in the error message did not match my service name's slot. I verified this with a simple test:

// My service name
String serviceName = "test-service";
String requestQueue = "{test-service:com.example.redisson.service.HelloService}";

// Calculate slots
int slot1 = calcSlot(serviceName);        // 288
int slot2 = calcSlot(requestQueue);       // 5570

Redisson was calculating the slot using the service name, but the Lua script inside redis was using the full request queue name. They had different hash tags, so different slots. The java code sends to slot 288 (Node A), but redis says the key is actually at slot 5570 (Node B). The operation never works, always get RedisNodeNotFoundException.

Root Cause: In redis cluster, keys are hashed to slots (0-16383). Redisson uses hash tags like {serviceName} to ensure related keys are on the same slot. But here's the bug:

// RedissonRemoteService.java - addAsync()
@Override
protected CompletableFuture<Boolean> addAsync(String requestQueueName, RemoteServiceRequest request,
                                              RemotePromise<Object> result) {
    // BUG: Uses 'name' for slot calculation
    RFuture<Boolean> future = commandExecutor.evalWriteNoRetryAsync(name, LongCodec.INSTANCE, RedisCommands.EVAL_BOOLEAN,
        // Lua script uses KEYS[1] = requestQueueName, KEYS[2] = requestQueueName + ":tasks"
        "redis.call('hset', KEYS[2], ARGV[1], ARGV[2]);"
      + "redis.call('rpush', KEYS[1], ARGV[1]); "
      + "return 1;",
        Arrays.asList(requestQueueName, requestQueueName + ":tasks"),  // KEYS[1], KEYS[2]
        request.getId(), encode(request));
    ...
}

Java calculates slot from: name = {test-service} → Slot 288
Lua script uses: requestQueueName = {test-service:com.example.redisson.service.HelloService} → Slot 5570

Affected Operations: I found this same bug pattern in 10 different places:

RRemoteService (4 bugs)

# File Line What It Affects
1 AsyncRemoteProxy.java 342 ACK polling - can't receive responses
2 RedissonRemoteService.java 85 addAsync - can't send requests
3 RedissonRemoteService.java 98 removeAsync - can't cancel requests
4 RedissonRemoteService.java 318 subscribe - server can't send ACK

RExecutorService (6 bugs)

# File Line What It Affects
5 RedissonExecutorService.java 478 shutdown - can't shutdown executor
6 ScheduledTasksService.java 97 schedule - can't schedule tasks
7 ScheduledTasksService.java 106 cancel - can't cancel scheduled tasks
8 TasksService.java 130 submit - can't submit tasks
9 TasksService.java 164 cancel - can't cancel tasks
10 RedissonExecutorService.java 1149 getTaskCount - can't get stats

All 10 bugs have the same root cause: Java uses name for slot, Lua uses a different key.

Reproduction: I created a reproduction test at https://github.com/ngyngcphu/redisson-slot-bug that triggers all 10 bugs. Each endpoint demonstrates a different slot calculation bug with detailed analysis of what slot was used vs what slot should have been used. Example:

curl "http://localhost:8080/bug/remote/bug1/poll-response?name=World"

Error: RedisNodeNotFoundException: Node: NodeSource [slot=5570, addr=redis://172.16.23.74:6379, redisClient=null, redirect=MOVED, entry=null] hasn't been discovered yet. Increase 'retryAttempts' setting. Last cluster nodes topology: f7a99453e22ea56853f11207d34b75d7bae808c8 10.50.***.***:31638@16379 slave f0dcbeadb242d167ea5c3d194e6bfb9e6a6377c9 0 1775444275000 2450 connected
09356c63096644328aa44fe89d29c66b7c2f4f4f 10.50.***.***:30397@16379 master - 0 1775444276336 2441 connected 0-5460
143cf62dad9ba03f221932c2b6afb985f27d7770 10.50.***.***:32213@16379 slave a696e298fd0ce310d35b00f1a23fb775428f7012 0 1775444274000 2451 connected
a696e298fd0ce310d35b00f1a23fb775428f7012 10.50.***.***:30246@16379 master - 0 1775444275327 2451 connected 5461-10922
c4fd20ba43567200c841b4a1eb8157d012206120 10.50.***.***:30553@16379 slave 09356c63096644328aa44fe89d29c66b7c2f4f4f 0 1775444273000 2441 connected
f0dcbeadb242d167ea5c3d194e6bfb9e6a6377c9 10.50.***.***:30375@16379 myself,master - 0 1775444274000 2450 connected 10923-16383

Additional information

  1. Reproduce bug: https://github.com/ngyngcphu/redisson-slot-bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions