Lettuce writing to closed channels, results in persistent failure #2530

greeaida · 2023-10-19T19:58:18Z

Bug Report

Preface: I expect that this is client configuration issue but I have been unable to find similar issues, similar documented scenarios or relevant networking details.

This is using cluster mode with ElastiCache.

Current Behavior

When a connection becomes unhealthy (e.g. failover or network disruption) there is no cache activity until a new node becomes primary. When all nodes are exhausted of primary, cache activity within the shard halts indefinitely.

Stack trace

Note that this will continue indefinitely. Healthy periodic refresh, cluster topology and close stale connections activity is still evident on the scheduled cadence.

// Healthy access
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandEncoder: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Received: 1341 bytes, 1 commands in the stack
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Stack contains: 1 commands

// Channel is closed - this was time of network failure
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive()
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0x9] deactivating endpoint handler
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive() done
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelUnregistered()

// Failure behavior with auto-reconnect disabled (io.lettuce.core.RedisException: Currently not connected. Commands are rejected)
[DEBUG] (<Server Thread>-144) io.lettuce.core.cluster.PooledClusterConnectionProvider: getConnection(READ, 16140)
[DEBUG] (<Server Thread>-144) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0xc] write() done
[DEBUG] (<Server Thread>-141) io.lettuce.core.RedisChannelHandler: dispatching command AsyncCommand [type=EVALSHA, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]

// Failure behavior with auto-reconnect enabled (io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 second(s))
[DEBUG] (<Server Thread>-287) io.lettuce.core.RedisChannelHandler: dispatching command AsyncCommand [type=EVALSHA, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
[DEBUG] (<Server Thread>-287) io.lettuce.core.cluster.PooledClusterConnectionProvider: getConnection(READ, 7732)
[DEBUG] (<Server Thread>-287) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0x49] writeToDisconnectedBuffer() buffering (disconnected) command ClusterCommand [command=AsyncCommand [type=EVALSHA, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
[DEBUG] (<Server Thread>-287) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0x49] write() done

Read/write activity never reaches the socket.

Input Code

clientResources = DefaultClientResources.builder()
    .reconnectDelay(
        Delay.fullJitter(
            MIN_RECONNECT_DELAY,
            MAX_RECONNECT_DELAY,
            FULL_JITTER_BASE, TimeUnit.MILLISECONDS)
).build();

RedisClusterClient clusterClient = RedisClusterClient.create(clientResources, <RedisURIBuilder>.build());

ClusterTopologyRefreshOptions clusterTopologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
    .enableAllAdaptiveRefreshTriggers()
    .enablePeriodicRefresh(<1 minute>)
    .closeStaleConnections(true)
    .dynamicRefreshSources(true)
    .build();

SocketOptions socketOptions = SocketOptions.builder()
    .connectTimeout(<100 ms>)
    .keepAlive(false)
    .tcpNoDelay(true)
    .build();

clusterClient.setOptions(ClusterClientOptions.builder()
    .autoReconnect(<true/false>)
    .topologyRefreshOptions(clusterTopologyRefreshOptions)
    .socketOptions(socketOptions)
    .build());

StatefulRedisClusterConnection<byte[], byte[]> connByteCodecSync = clusterClient.connect(ByteArrayCodec.INSTANCE);
    connByteCodecSync.setReadFrom(ReadFrom.NEAREST);
    connByteCodecSync.setTimeout(<1 second>);

StatefulRedisClusterConnection<byte[], byte[]> connByteCodecAsync = clusterClient.connect(ByteArrayCodec.INSTANCE);
    connByteCodecAsync.setReadFrom(ReadFrom.MASTER);
    connByteCodecAsync.setTimeout(<1 second>);

StatefulRedisClusterConnection<String, String> connStringCodec = clusterClient.connect(StringCodec.UTF8);
    connStringCodec.setReadFrom(ReadFrom.MASTER);
    connStringCodec.setTimeout(<1 second>);

It then appears .sync() is used on these connections each access as the primitive for our cache interaction.

Expected behavior/code

I would expect closeStaleConnections() (or some other health check) to stop writing to these channels and instead create new channels.

Environment

Lettuce version(s): 5.3.x
Redis version: 7.0.x

Possible Solution

Mitigations are restarting host, adding shards/nodes, failing over to node that wasn't hasn't been primary since boot.

Additional context

One quick way I've reproduced this is by using sudo ss -K dport = 6379 to terminate connections on the socket. It will not recover until I fail over primary.

The text was updated successfully, but these errors were encountered:

mp911de · 2023-10-26T10:07:24Z

I'm not exactly sure what you're asking for. Let me explain how things work, then we might get to a common understanding.

Commands in Cluster operations are sent to a particular node. If the node goes down (network partition, process dies), then commands targeting that node remain queued and are sent to the node once the node comes back online.

Once commands are queued for a particular node, they reside in the queue until the connection is closed (i.e. because the node is no longer part of the cluster).

If you want a different behavior, then rejecting commands on disconnect is the way to go.

I'd be also interested to hear what other behavior you would expect.

greeaida · 2023-10-26T12:41:33Z

Using the debug logs and your verbiage to convery how I understand this error:

Commands sent over channel 0x0b2a62a5 (which is tied to a particular node apparent from the Node CNAME) execute successfully.

[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandEncoder: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Received: 1341 bytes, 1 commands in the stack
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Stack contains: 1 commands

The connection to this node is interrupted. I simulated this by closing configured Redis socket with sudo ss -K dport = 6379. Channel 0x0b2a62a5 is deactivated and deregistered.

[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive()
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0x9] deactivating endpoint handler
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive() done
[DEBUG]  (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelUnregistered()

With auto-reconnect disabled and implicit reject command queue behavior, commands begin failing. They continue to attempt to send commands to channel 0x0b2a62a5, despite it being closed.

[DEBUG] (<Server Thread>-144) io.lettuce.core.cluster.PooledClusterConnectionProvider: getConnection(READ, 16140)
[DEBUG] (<Server Thread>-144) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0xc] write() done
[DEBUG] (<Server Thread>-141) io.lettuce.core.RedisChannelHandler: dispatching command AsyncCommand [type=EVALSHA, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
[INFO] (<Application>) io.lettuce.core.RedisException: Currently not connected. Commands are rejected

The node comes back online. Periodic refresh observes the recovered Node CNAME and establishes a connection over a new channel to exchange cluster information.
After node recovery, commands continue to be sent to the closed channel 0x0b2a62a5 where connections are rejected.

The behavior I would expect would be for Lettuce to automatically establish a connection with the recovered node and send commands over the new healthy connection rather than the closed connection.

mp911de · 2023-10-26T12:56:45Z

The behavior I would expect would be for Lettuce to automatically establish a connection with the recovered node and send commands over the new healthy connection rather than the closed connection.

Please pay attention to the endpoint id epid=0xc which remains stable as identifier of the logical connection. Channel is the physical one that gets recreated upon reconnect.

Each endpoint is tied to an endpoint defined by hostname and port.

After node recovery,

This translates to: When the failed node is back again at the previous hostname and port.

Please also note that when a node is promoted from replica to master for the previously assigned slots, then the commands still remain with the connection that is disconnected.

Conceptually, after sending a command to a particular connection (node), we do not know why the command was routed there (by intent, by ASK/MOVED redirection, by routing rules).

greeaida · 2023-10-26T14:07:21Z

Please pay attention to the endpoint id epid=0xc which remains stable as identifier of the logical connection. Channel is the physical one that gets recreated upon reconnect.

We cache DNS forever (JVM misconfiguration, not intentional). Does this mean an unhealthy endpoint / logical connection could be persisted indefinitely and prevent the channel from being recreated?

Please also note that when a node is promoted from replica to master for the previously assigned slots, then the commands still remain with the connection that is disconnected.

An interesting symptom of our failure scenario is that even when nodes recover, commands continue to be rejected until failover to a replica occurs. Healthy activity will then resume, but if the former master is later selected for promotion, commands once again become rejected and failure recurs. I don't fully understand what 'remain with the connection is disconnected' means; does this only apply to queued commands, or does it suggest that application logic is required to correct the connection after failover?

mp911de · 2023-10-26T15:20:00Z

Redis Cluster mostly works with IP addresses and you can configure DNS resolvers for Lettuce.

I don't fully understand what 'remain with the connection is disconnected' means; does this only apply to queued commands,

To queued commands and commands routed to that connection if Partitions wasn't refreshed yet or no failover has happened.

github-actions · 2024-10-26T00:16:01Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 30 days this issue will be closed.

mp911de added the status: waiting-for-feedback We need additional information before we can continue label Oct 26, 2023

github-actions bot added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label Oct 26, 2024

github-actions bot closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lettuce writing to closed channels, results in persistent failure #2530

Lettuce writing to closed channels, results in persistent failure #2530

greeaida commented Oct 19, 2023 •

edited

Loading

mp911de commented Oct 26, 2023

greeaida commented Oct 26, 2023

mp911de commented Oct 26, 2023

greeaida commented Oct 26, 2023

mp911de commented Oct 26, 2023

github-actions bot commented Oct 26, 2024

Lettuce writing to closed channels, results in persistent failure #2530

Lettuce writing to closed channels, results in persistent failure #2530

Comments

greeaida commented Oct 19, 2023 • edited Loading

Bug Report

Current Behavior

Input Code

Expected behavior/code

Environment

Possible Solution

Additional context

mp911de commented Oct 26, 2023

greeaida commented Oct 26, 2023

mp911de commented Oct 26, 2023

greeaida commented Oct 26, 2023

mp911de commented Oct 26, 2023

github-actions bot commented Oct 26, 2024

greeaida commented Oct 19, 2023 •

edited

Loading