-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Environment
- Jedis Version: 5.1.0
- Java Version: OpenJDK 17
- Redis Version: 8.0.0 with RediSearch
Problem Description
Under concurrent load, JedisCluster intermittently throws:
redis.clients.jedis.exceptions.JedisClusterOperationException: No reachable node in cluster.
at redis.clients.jedis.providers.ClusterConnectionProvider.getConnection(ClusterConnectionProvider.java:135)
Critical observation: The exception has zero suppressed exceptions (e.getSuppressed().length == 0), indicating no connection attempt was made.
Evidence
Cluster Is Healthy and Reachable at Startup as well as when checked with CLI
- Error Has Zero Suppressed Exceptions
We added logging to capture suppressed exceptions:
catch (Exception e) {
log.error("Failed: {}", e.getMessage());
for (Throwable t : e.getSuppressed()) {
log.error(" Cause: {}", t.getMessage()); // Never printed
}
throw e;
}
Output:
ERROR | Failed: No reachable node in cluster.
No "Cause:" line is ever printed, confirming e.getSuppressed() returns an empty array.
-
Verified Network Connectivity
Ran redis-cli -c from application host → 100% success. Network is healthy. All nodes reachable. -
Verified JedisCluster Discovers All Nodes
At startup, JedisCluster successfully connects to all nodes and executes commands. The issue only occurs under concurrent load. -
redis-cli Works 100%
-
JedisCluster Fails ~50%
Same 50 concurrent requests has different success rate
Steps to Reproduce
- Set up a Redis Cluster (3+ masters with replicas)
- Configure JedisCluster with moderate pool size (maxTotal=10)
- Send 50+ concurrent requests
- Observe intermittent "No reachable node in cluster" errors
- Check that e.getSuppressed() returns empty array
- Compare with redis-cli -c running same concurrent test (100% success)
Root Cause Analysis
Looking at File src/main/java/redis/clients/jedis/providers/ClusterConnectionProvider.java:
public Connection getConnection() {
List<ConnectionPool> pools = this.cache.getShuffledNodesPool(); // ← Returns EMPTY
JedisException suppressed = null;
while(var3.hasNext()) { // Never executes if pools is empty
ConnectionPool pool = (ConnectionPool)var3.next();
try {
jedis = pool.getResource();
if (jedis != null) {
jedis.ping();
return jedis;
}
} catch (JedisException ex) {
if (suppressed == null) {
suppressed = ex;
}
}
}
// suppressed is still null because loop never ran
JedisClusterOperationException noReachableNode =
new JedisClusterOperationException("No reachable node in cluster.");
if (suppressed != null) { // False - nothing to add
noReachableNode.addSuppressed(suppressed);
}
throw noReachableNode;
}
Why getShuffledNodesPool() Returns Empty
The cluster node cache appears to have a race condition during refresh:
Timeline:
─────────────────────────────────────
Thread A (cache refresh): cache.clear() → [fetch nodes] → cache.populate()
↑
Thread B (request): getShuffledNodesPool() → EMPTY → throw error
─────────────────────────────────────
During the brief window when the cache is cleared but not yet repopulated:
- getShuffledNodesPool() returns an empty list
- The while loop never executes
- Suppressed remains null
- Exception is thrown with no suppressed exceptions
- Error message says "No reachable node" even though no connection was attempted
Suggested Fix
Option 1: Atomic cache refresh (copy-on-write)
// Build entirely new cache while old cache still serves requests
Map<String, ConnectionPool> newNodes = discoverClusterNodes();
// Single reference swap - atomic
this.nodesCache = newNodes;
// No window where cache is empty
Option 2: Retry if pool list is empty
List<ConnectionPool> pools = this.cache.getShuffledNodesPool();
if (pools.isEmpty()) {
Thread.sleep(50); // Brief wait for cache refresh
pools = this.cache.getShuffledNodesPool();
}
Impact
- Applications using JedisCluster experience intermittent failures under concurrent load
- Error message is misleading ("no reachable node" suggests network/cluster issues when the real cause is internal cache state)
- Debugging is difficult because suppressed exceptions are empty
Workaround
Require application-level retry logic
Questions
- Is the cluster cache refresh designed to be atomic?
- Is there a known race condition in ClusterNodesCache?
- Would you accept a PR implementing copy-on-write for the cache refresh?