JedisCluster throws "No reachable node in cluster" with zero suppressed exceptions - race condition in cluster cache

### **Environment**
- Jedis Version: 5.1.0
- Java Version: OpenJDK 17
- Redis Version: 8.0.0 with RediSearch

### **Problem Description**
Under concurrent load, **JedisCluster** intermittently throws:
```
redis.clients.jedis.exceptions.JedisClusterOperationException: No reachable node in cluster.
    at redis.clients.jedis.providers.ClusterConnectionProvider.getConnection(ClusterConnectionProvider.java:135)
```

_Critical observation_: The exception has **zero suppressed exceptions (e.getSuppressed().length == 0)**, indicating no connection attempt was made.

### **Evidence**
Cluster Is Healthy and Reachable at Startup as well as when checked with CLI

1. Error Has Zero Suppressed Exceptions
We added logging to capture suppressed exceptions:
```
catch (Exception e) {
    log.error("Failed: {}", e.getMessage());
    for (Throwable t : e.getSuppressed()) {
        log.error("  Cause: {}", t.getMessage());   // Never printed
    }
    throw e;
}
```

Output:
`ERROR | Failed: No reachable node in cluster.`
No "Cause:" line is ever printed, confirming e.getSuppressed() returns an empty array.

2. Verified Network Connectivity
Ran redis-cli -c from application host → 100% success. Network is healthy. All nodes reachable.

3. Verified JedisCluster Discovers All Nodes
At startup, JedisCluster successfully connects to all nodes and executes commands. The issue only occurs under concurrent load.

4. redis-cli Works 100% 
5. JedisCluster Fails ~50%
Same 50 concurrent requests has different success rate 

### **Steps to Reproduce**

1. Set up a Redis Cluster (3+ masters with replicas)
2. Configure JedisCluster with moderate pool size (maxTotal=10)
3. Send 50+ concurrent requests
4. Observe intermittent "No reachable node in cluster" errors
5. Check that e.getSuppressed() returns empty array
6. Compare with redis-cli -c running same concurrent test (100% success)

### **Root Cause Analysis** 
Looking at File **_src/main/java/redis/clients/jedis/providers/ClusterConnectionProvider.java_**:
```
public Connection getConnection() {
    List<ConnectionPool> pools = this.cache.getShuffledNodesPool();  // ← Returns EMPTY
    JedisException suppressed = null;
    
    while(var3.hasNext()) {  // Never executes if pools is empty
        ConnectionPool pool = (ConnectionPool)var3.next();
        try {
            jedis = pool.getResource();
            if (jedis != null) {
                jedis.ping();
                return jedis;
            }
        } catch (JedisException ex) {
            if (suppressed == null) {
                suppressed = ex;
            }
        }
    }
    
    // suppressed is still null because loop never ran
    JedisClusterOperationException noReachableNode = 
        new JedisClusterOperationException("No reachable node in cluster.");
    if (suppressed != null) {  // False - nothing to add
        noReachableNode.addSuppressed(suppressed);
    }
    throw noReachableNode;
}

```
### **Why getShuffledNodesPool() Returns Empty**
The cluster node cache appears to have a race condition during refresh:

```
Timeline:
─────────────────────────────────────
Thread A (cache refresh):     cache.clear() → [fetch nodes] → cache.populate()
                                  ↑
Thread B (request):           getShuffledNodesPool() → EMPTY → throw error
─────────────────────────────────────
```

During the brief window when the cache is cleared but not yet repopulated:

1. getShuffledNodesPool() returns an empty list
2. The while loop never executes
3. Suppressed remains null
4. Exception is thrown with no suppressed exceptions
5. Error message says "No reachable node" even though no connection was attempted

### **Suggested Fix**

### Option 1: Atomic cache refresh (copy-on-write)

```
// Build entirely new cache while old cache still serves requests
Map<String, ConnectionPool> newNodes = discoverClusterNodes();

// Single reference swap - atomic
this.nodesCache = newNodes;

// No window where cache is empty
```

### Option 2: Retry if pool list is empty
```
List<ConnectionPool> pools = this.cache.getShuffledNodesPool();
if (pools.isEmpty()) {
    Thread.sleep(50);  // Brief wait for cache refresh
    pools = this.cache.getShuffledNodesPool();
}
```

### **Impact**

- Applications using JedisCluster experience intermittent failures under concurrent load
- Error message is misleading ("no reachable node" suggests network/cluster issues when the real cause is internal cache state)
- Debugging is difficult because suppressed exceptions are empty

### **Workaround**
Require application-level retry logic

### **Questions**

1. Is the cluster cache refresh designed to be atomic?
2. Is there a known race condition in ClusterNodesCache?
3. Would you accept a PR implementing copy-on-write for the cache refresh?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JedisCluster throws "No reachable node in cluster" with zero suppressed exceptions - race condition in cluster cache #4388

Environment

Problem Description

Evidence

Steps to Reproduce

Root Cause Analysis

Why getShuffledNodesPool() Returns Empty

Suggested Fix

Option 1: Atomic cache refresh (copy-on-write)

Option 2: Retry if pool list is empty

Impact

Workaround

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JedisCluster throws "No reachable node in cluster" with zero suppressed exceptions - race condition in cluster cache #4388

Description

Environment

Problem Description

Evidence

Steps to Reproduce

Root Cause Analysis

Why getShuffledNodesPool() Returns Empty

Suggested Fix

Option 1: Atomic cache refresh (copy-on-write)

Option 2: Retry if pool list is empty

Impact

Workaround

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions