Skip to content

JedisCluster throws "No reachable node in cluster" with zero suppressed exceptions - race condition in cluster cache #4388

@aditya-baldwa

Description

@aditya-baldwa

Environment

  • Jedis Version: 5.1.0
  • Java Version: OpenJDK 17
  • Redis Version: 8.0.0 with RediSearch

Problem Description

Under concurrent load, JedisCluster intermittently throws:

redis.clients.jedis.exceptions.JedisClusterOperationException: No reachable node in cluster.
    at redis.clients.jedis.providers.ClusterConnectionProvider.getConnection(ClusterConnectionProvider.java:135)

Critical observation: The exception has zero suppressed exceptions (e.getSuppressed().length == 0), indicating no connection attempt was made.

Evidence

Cluster Is Healthy and Reachable at Startup as well as when checked with CLI

  1. Error Has Zero Suppressed Exceptions
    We added logging to capture suppressed exceptions:
catch (Exception e) {
    log.error("Failed: {}", e.getMessage());
    for (Throwable t : e.getSuppressed()) {
        log.error("  Cause: {}", t.getMessage());   // Never printed
    }
    throw e;
}

Output:
ERROR | Failed: No reachable node in cluster.
No "Cause:" line is ever printed, confirming e.getSuppressed() returns an empty array.

  1. Verified Network Connectivity
    Ran redis-cli -c from application host → 100% success. Network is healthy. All nodes reachable.

  2. Verified JedisCluster Discovers All Nodes
    At startup, JedisCluster successfully connects to all nodes and executes commands. The issue only occurs under concurrent load.

  3. redis-cli Works 100%

  4. JedisCluster Fails ~50%
    Same 50 concurrent requests has different success rate

Steps to Reproduce

  1. Set up a Redis Cluster (3+ masters with replicas)
  2. Configure JedisCluster with moderate pool size (maxTotal=10)
  3. Send 50+ concurrent requests
  4. Observe intermittent "No reachable node in cluster" errors
  5. Check that e.getSuppressed() returns empty array
  6. Compare with redis-cli -c running same concurrent test (100% success)

Root Cause Analysis

Looking at File src/main/java/redis/clients/jedis/providers/ClusterConnectionProvider.java:

public Connection getConnection() {
    List<ConnectionPool> pools = this.cache.getShuffledNodesPool();  // ← Returns EMPTY
    JedisException suppressed = null;
    
    while(var3.hasNext()) {  // Never executes if pools is empty
        ConnectionPool pool = (ConnectionPool)var3.next();
        try {
            jedis = pool.getResource();
            if (jedis != null) {
                jedis.ping();
                return jedis;
            }
        } catch (JedisException ex) {
            if (suppressed == null) {
                suppressed = ex;
            }
        }
    }
    
    // suppressed is still null because loop never ran
    JedisClusterOperationException noReachableNode = 
        new JedisClusterOperationException("No reachable node in cluster.");
    if (suppressed != null) {  // False - nothing to add
        noReachableNode.addSuppressed(suppressed);
    }
    throw noReachableNode;
}

Why getShuffledNodesPool() Returns Empty

The cluster node cache appears to have a race condition during refresh:

Timeline:
─────────────────────────────────────
Thread A (cache refresh):     cache.clear() → [fetch nodes] → cache.populate()
                                  ↑
Thread B (request):           getShuffledNodesPool() → EMPTY → throw error
─────────────────────────────────────

During the brief window when the cache is cleared but not yet repopulated:

  1. getShuffledNodesPool() returns an empty list
  2. The while loop never executes
  3. Suppressed remains null
  4. Exception is thrown with no suppressed exceptions
  5. Error message says "No reachable node" even though no connection was attempted

Suggested Fix

Option 1: Atomic cache refresh (copy-on-write)

// Build entirely new cache while old cache still serves requests
Map<String, ConnectionPool> newNodes = discoverClusterNodes();

// Single reference swap - atomic
this.nodesCache = newNodes;

// No window where cache is empty

Option 2: Retry if pool list is empty

List<ConnectionPool> pools = this.cache.getShuffledNodesPool();
if (pools.isEmpty()) {
    Thread.sleep(50);  // Brief wait for cache refresh
    pools = this.cache.getShuffledNodesPool();
}

Impact

  • Applications using JedisCluster experience intermittent failures under concurrent load
  • Error message is misleading ("no reachable node" suggests network/cluster issues when the real cause is internal cache state)
  • Debugging is difficult because suppressed exceptions are empty

Workaround

Require application-level retry logic

Questions

  1. Is the cluster cache refresh designed to be atomic?
  2. Is there a known race condition in ClusterNodesCache?
  3. Would you accept a PR implementing copy-on-write for the cache refresh?

Metadata

Metadata

Assignees

No one assigned

    Labels

    waiting-for-feedbackWe need additional information before we can continue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions