-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lettuce writing to closed channels, results in persistent failure #2530
Comments
I'm not exactly sure what you're asking for. Let me explain how things work, then we might get to a common understanding. Commands in Cluster operations are sent to a particular node. If the node goes down (network partition, process dies), then commands targeting that node remain queued and are sent to the node once the node comes back online. Once commands are queued for a particular node, they reside in the queue until the connection is closed (i.e. because the node is no longer part of the cluster). If you want a different behavior, then rejecting commands on disconnect is the way to go. I'd be also interested to hear what other behavior you would expect. |
Using the debug logs and your verbiage to convery how I understand this error:
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] write(ctx, ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5], promise)
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandEncoder: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379] writing command ClusterCommand [command=AsyncCommand [type=GET, output=ValueOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command], redirections=0, maxRedirections=5]
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Received: 1341 bytes, 1 commands in the stack
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] Stack contains: 1 commands
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive()
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0x9] deactivating endpoint handler
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelInactive() done
[DEBUG] (lettuce-epollEventLoop-4-9) io.lettuce.core.protocol.CommandHandler: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, chid=0x9] channelUnregistered()
[DEBUG] (<Server Thread>-144) io.lettuce.core.cluster.PooledClusterConnectionProvider: getConnection(READ, 16140)
[DEBUG] (<Server Thread>-144) io.lettuce.core.protocol.DefaultEndpoint: [channel=0x0b2a62a5, /<Server IP>:<Server Port> -> <Node CNAME>/<Node IP>:6379, epid=0xc] write() done
[DEBUG] (<Server Thread>-141) io.lettuce.core.RedisChannelHandler: dispatching command AsyncCommand [type=EVALSHA, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
[INFO] (<Application>) io.lettuce.core.RedisException: Currently not connected. Commands are rejected
The behavior I would expect would be for Lettuce to automatically establish a connection with the recovered node and send commands over the new healthy connection rather than the closed connection. |
Please pay attention to the endpoint id Each endpoint is tied to an endpoint defined by hostname and port.
This translates to: When the failed node is back again at the previous hostname and port. Please also note that when a node is promoted from replica to master for the previously assigned slots, then the commands still remain with the connection that is disconnected. Conceptually, after sending a command to a particular connection (node), we do not know why the command was routed there (by intent, by ASK/MOVED redirection, by routing rules). |
We cache DNS forever (JVM misconfiguration, not intentional). Does this mean an unhealthy endpoint / logical connection could be persisted indefinitely and prevent the channel from being recreated?
An interesting symptom of our failure scenario is that even when nodes recover, commands continue to be rejected until failover to a replica occurs. Healthy activity will then resume, but if the former master is later selected for promotion, commands once again become rejected and failure recurs. I don't fully understand what 'remain with the connection is disconnected' means; does this only apply to queued commands, or does it suggest that application logic is required to correct the connection after failover? |
Redis Cluster mostly works with IP addresses and you can configure DNS resolvers for Lettuce.
To queued commands and commands routed to that connection if |
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 30 days this issue will be closed. |
Bug Report
Preface: I expect that this is client configuration issue but I have been unable to find similar issues, similar documented scenarios or relevant networking details.
This is using cluster mode with ElastiCache.
Current Behavior
When a connection becomes unhealthy (e.g. failover or network disruption) there is no cache activity until a new node becomes primary. When all nodes are exhausted of primary, cache activity within the shard halts indefinitely.
Stack trace
Note that this will continue indefinitely. Healthy periodic refresh, cluster topology and close stale connections activity is still evident on the scheduled cadence.
Read/write activity never reaches the socket.
Input Code
Input Code
It then appears
.sync()
is used on these connections each access as the primitive for our cache interaction.Expected behavior/code
I would expect
closeStaleConnections()
(or some other health check) to stop writing to these channels and instead create new channels.Environment
Possible Solution
Mitigations are restarting host, adding shards/nodes, failing over to node that wasn't hasn't been primary since boot.
Additional context
One quick way I've reproduced this is by using
sudo ss -K dport = 6379
to terminate connections on the socket. It will not recover until I fail over primary.The text was updated successfully, but these errors were encountered: