-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce nodeFilter
Predicate
to filter Partitions
#1942
Comments
You can override I'm not sure whether TCP keep-alive is helping in this case. |
@mp911de Yeah I don't think TCP keep-alive helps either, but I wanted to call out that I had looked at it as a possibility. |
I tested reworking https://github.com/lettuce-io/lettuce-core/blob/4110f2820766c4967951639aa2b6bdd9d50466be/src/main/java/io/lettuce/core/cluster/RedisClusterClient.java#L1025-L1032 to filter out FAIL and EVENTUAL_FAIL nodes and achieved the same recovery after the periodic-topology-refresh interval. |
I wonder whether it makes generally sense to introduce a node filter ( |
I like the sound of that over a subclass; the behavior will look more baked-in. I did try calling removeIf against the Partitions collection with a predicate and it threw a unimplemented-error, so I guess that'd have to be added as well. |
nodeFilter
Predicate
to filter Partitions
That's in place now. |
Adding |
Is this the default behavior in latest lettuce without explicitly specifying the node filters? If the answer is no, what is the reason for not making this the default behavior? |
You can specify a |
Thanks for the reason! Can this be made the default in the next major version? This seems like a good default for most use cases. Folks run into this problem frequently that there is even a guide to set the predicates in AWS's documentation https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/BestPractices.Clients-lettuce.html#:~:text=Set%20nodeFilter,retrying%20is%20exhausted. |
Thanks @srgsanky for the background. Can you file a new ticket to flip the default? |
Bug Report
Current Behavior
When Lettuce is connected to a 2 node Redis Cluster (1 shard, 1 replica), and is configured for REPLICA_PREFERRED, and the replica ceases responding to TCP (such as a ungraceful hardware failure), Lettuce does not recover until the TCP retry counter expires for that connection (~926 seconds).
Input Code
Input Code
(Assumes the hostname `redis` resolves to all nodes in the group)Expected behavior/code
When the timeout is reached and a dynamic topology refresh is triggered, connections to the node in "fail?" state should be considered stale and closed / abandoned.
Environment
Possible Solution
A workaround, undesirable as it is global-to-the-node in nature, is to shorten the TCP retry counter on the client:
At first glance it looks like adding a filter for failed/eventual_fail nodes at https://github.com/lettuce-io/lettuce-core/blob/cda3be6b9477da790365ad098c6e39c8687f5002/src/main/java/io/lettuce/core/cluster/topology/DefaultClusterTopologyRefresh.java#L292-L296 would cap the duration of the failure scenario at the periodic-topology-refresh interval.
Additional context
TCPdump clearly shows the client getting a topology refresh occurs, but the client does not recover until the existing TCP connection exits.
Failure of the replica node was simulated by dropping all Redis packets on the replica:
Redis.conf contains:
TCP KeepAlive's do not help here as the connection is not idle.
The text was updated successfully, but these errors were encountered: