Can I reconnect on RedisCommandTimeoutException? #2870

e-ts · 2024-06-03T12:42:20Z

e-ts
Jun 3, 2024

Can I reconnect to the node used when I catch a RedisCommandTimeoutException for a command to a Redis Cluster?

We are having a problem where the old master does not respond for 10 seconds after a FAILOVER is issued to its replica. TCP packets with new requests still get acked during these 10 seconds. As the connection is clearly not dead, Lettuce keeps sending new commands to the old master. Eventually, it will receive all the MOVED response at once but this is too late for us.

For our specific problem, it would be better if Lettuce reconnected to the node on command timeout as the bug only seems to affect a single TCP socket. A command on a new socket will get an immediate MOVED response, allowing Lettuce to continue on the master.

I guess it could be tricky to get this right as all the requests in flight will time out at different times and we probably do not want to reconnect for each timeout.

Of course, we are trying to get the underling problem with Redis resolved too, see #2572 but a work-around like this would still be useful until that gets fixed.

I have checked the wiki, GitHub issues and GitHub Discussions and found #2082 which is similar but in that case, the TCP packets do not get acked, leading to another solution.

I tried setting an absurdly low periodic refresh of a few hundred milliseconds but that does not seem to help, which might be a bug but I have not looked into it yet.

Answered by tishun

Jul 2, 2024

TCP_USER_TIMEOUT specifies the maximum amount of time that transmitted data may remain unacknowledged, or buffered data may remain untransmitted. In our case, data is acknowledged on TCP layer so TCP_USER_TIMEOUT will not help. We're using it anyway, and keepAlive, but for another reason. They do not help in this specific scenario.

My bad, missed the fact the packages are being acknowledged. You are right, TCP_USER_TIMEOUT is useless in this case.

I can close the entire StatefulGenericConnection on RedisCommandTimeoutException and make some logic to set up a new one but it seems way too blunt for this scenario. Closing just that TCP socket would suffice and it would be nice to make use…

View full answer

tishun · 2024-06-03T13:21:45Z

tishun
Jun 3, 2024
Maintainer

Hey @e-ts ,

this is a tricky question. In your scenario you know that a command would time out because of the Redis instance delaying its responses due to failover, but this is a very specific failover scenario. In practice a command could time out due to many different reasons (network delay, server load, etc.) and in many of those cases the correct approach would be to resend the command to the same instance without reconnecting to the same server.

Reconnect is a slow process and if we reconnect on each timeout we might drastically decrease the performance of the driver in many of those cases.

You mentioned #2082, did you manage to check out the option to set a custom TCP_USER_TIMEOUT with #2499?
For example, to configure the client to detect such a scenario on a 3 second timeout:

        RedisClient redisClient = RedisClient.create(RedisURI.Builder
                .redis("redis.io", 12000)
                .build());
        SocketOptions socketOptions = SocketOptions.builder()
                .tcpUserTimeout(SocketOptions.TcpUserTimeoutOptions.builder()
                        .enable(true)
                        .tcpUserTimeout(Duration.ofSeconds(3))
                        .build())
                .keepAlive(SocketOptions.KeepAliveOptions.builder()
                        .interval(Duration.ofSeconds(5))
                        .idle(Duration.ofSeconds(5))
                        .count(3).enable()
                        .build())
                .build();

        redisClient.setOptions(ClientOptions.builder().socketOptions(socketOptions).build());

        RedisCommands<String, String> redis = redisClient.connect(new StringCodec()).sync();

If you do go that way have in mind that - as with my previous note - this timeout might be caused by different (valid) scenarios, that do not require re-connect. You should be mindful of the value you set. But I think, from your description, it should resolve your issue if your deployment is such that the server always responds within a couple of seconds. Also have in mind that it is generally a good idea to set up a KEEPALIVE setting too. All values there are highly dependent on your application and deployment.

8 replies

e-ts Jun 4, 2024
Author

TCP_USER_TIMEOUT specifies the maximum amount of time that transmitted data may remain unacknowledged, or buffered data may remain untransmitted. In our case, data is acknowledged on TCP layer so TCP_USER_TIMEOUT will not help. We're using it anyway, and keepAlive, but for another reason. They do not help in this specific scenario.

I can close the entire StatefulGenericConnection on RedisCommandTimeoutException and make some logic to set up a new one but it seems way too blunt for this scenario. Closing just that TCP socket would suffice and it would be nice to make use of Lettuce's excellent built-in functionality to reconnect. I don't see how I can do that with Lettuce's API but I might be missing something, and that's why I opened this discussion.

Can I invalidate the TCP connection used when I catch a RedisCommandTimeoutException for a command to a Redis Cluster?

Maybe something clever can be done with NettyCustomizer#afterChannelInitialized but I'm not sure where to begin without digging deep into Lettuce internals.

tishun Jul 2, 2024
Maintainer

TCP_USER_TIMEOUT specifies the maximum amount of time that transmitted data may remain unacknowledged, or buffered data may remain untransmitted. In our case, data is acknowledged on TCP layer so TCP_USER_TIMEOUT will not help. We're using it anyway, and keepAlive, but for another reason. They do not help in this specific scenario.

My bad, missed the fact the packages are being acknowledged. You are right, TCP_USER_TIMEOUT is useless in this case.

I can close the entire StatefulGenericConnection on RedisCommandTimeoutException and make some logic to set up a new one but it seems way too blunt for this scenario. Closing just that TCP socket would suffice and it would be nice to make use of Lettuce's excellent built-in functionality to reconnect. I don't see how I can do that with Lettuce's API but I might be missing something, and that's why I opened this discussion.

The closest I was able to find was what we do in the ClientOptionsIntegrationTests.java but this is not something I can propose as a solution with a clean conscience.

TBH it looks to me like a good thing to have - a forceReconnect() method of the StatefulConnection interface.
If you think that would solve the problem we can create a ticket out of it.

Can I invalidate the TCP connection used when I catch a RedisCommandTimeoutException for a command to a Redis Cluster?

Maybe something clever can be done with NettyCustomizer#afterChannelInitialized but I'm not sure where to begin without digging deep into Lettuce internals.

Perhaps this is possible too, but I can't think of a way right now. If you do end up going that way please share your solution, so the others might benefit from it.

Answer selected by e-ts

e-ts Jul 8, 2024
Author

Yes, I think a method on StatefulRedisConnection to force a reconnect would help.

I tried getting the failing StatefulRedisConnection from a StatefulRedisClusterConnection with SlotHash#getSlot Partitions#getMasterBySlot and StatefulRedisClusterConnection#getConnection but closing with close didn't seem to actually close the underlying connection.

I can see io.lettuce.core.RedisChannelHandler closeAsync being called for the expected node but tcpdump shows that Lettuce continues to use the use the same TCP socket for new commands indefinitely.

tishun Jul 17, 2024
Maintainer

Hey @e-ts ,

I was just looking at this method which already exists:

    /**
     * Instructs Redis to disconnect the connection. Note that if auto-reconnect is enabled then Lettuce will auto-reconnect if
     * the connection was disconnected. Use {@link io.lettuce.core.api.StatefulConnection#close} to close connections and
     * release resources.
     *
     * @return String simple-string-reply always OK.
     */
    RedisFuture<String> quit();

Would that do the job?

e-ts Jul 18, 2024
Author

That looks great but I don't think I got it to work.

I used StatefulRedisClusterConnection#getConnection(String) to get a StatefulRedisConnection for the node, StatefulRedisConnection#syncto get RedisCommands and finally BaseRedisCommands#quit to make it disconnect.

With a log message, I can see we must have reached this quit in our reproducer but studying tcpdump, I see that the same TCP socket is being used throughout the whole problem period and beyond the problem period. I see no other sockets being used to send command to either of the two nodes except for what I think are periodic cluster refreshes.

Update: removing the topology refresh configuration in my reproducer, I see we do create new sockets, but they are only used to send QUIT and we do not send QUIT on the sockets that are used to send the main commands. =)

Tested on Lettuce 6.3.1 and 6.3.2

tishun Jul 19, 2024
Maintainer

Could you share some steps I can reproduce this?

I verified locally that we disconnect when we call .quit() so I assume there is something specific to you use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I reconnect on RedisCommandTimeoutException? #2870

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Can I reconnect on RedisCommandTimeoutException? #2870

e-ts Jun 3, 2024

Replies: 1 comment · 8 replies

tishun Jun 3, 2024 Maintainer

e-ts Jun 4, 2024 Author

tishun Jul 2, 2024 Maintainer

e-ts Jul 8, 2024 Author

tishun Jul 17, 2024 Maintainer

e-ts Jul 18, 2024 Author

tishun Jul 19, 2024 Maintainer

e-ts
Jun 3, 2024

Replies: 1 comment 8 replies

tishun
Jun 3, 2024
Maintainer

e-ts Jun 4, 2024
Author

tishun Jul 2, 2024
Maintainer

e-ts Jul 8, 2024
Author

tishun Jul 17, 2024
Maintainer

e-ts Jul 18, 2024
Author

tishun Jul 19, 2024
Maintainer