None recovering Redis command timed out #3010

mshahmoradi87 · 2024-10-11T12:07:49Z

Bug Report

One container starts to have "Redis command timed out" and it does not recover.

Current Behavior

we have started to use aws redis serveless, since the switch from redis cluster-mode enabled to redis serverless, out of the blue one client from many tasks running starts to have command timed out continuously. Depending on the service and size of objects and if put or get commands are being timed out, we see different side effects, sometimes it results in #705, sometimes only increased latency.

Note that in the services this happens, usually there are more than 12 tasks running, and only one container has this issue.

I could correlate any CPU spike with the event causing the timeouts.

org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed out after 5 second(s)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:70)
	at org.springframework.data.redis.connection.lettuce.LettuceExceptionConverter.convert(LettuceExceptionConverter.java:41)
	at org.springframework.data.redis.PassThroughExceptionTranslationStrategy.translate(PassThroughExceptionTranslationStrategy.java:44)
	at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:42)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:277)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1085)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.lambda$doInvoke$4(LettuceConnection.java:938)
	at org.springframework.data.redis.connection.lettuce.LettuceInvoker$Synchronizer.invoke(LettuceInvoker.java:665)
	at org.springframework.data.redis.connection.lettuce.LettuceInvoker.just(LettuceInvoker.java:125)
	at org.springframework.data.redis.connection.lettuce.LettuceHashCommands.hSet(LettuceHashCommands.java:61)
	at org.springframework.data.redis.connection.DefaultedRedisConnection.hSet(DefaultedRedisConnection.java:1332)
	at org.springframework.data.redis.connection.DefaultStringRedisConnection.hSet(DefaultStringRedisConnection.java:631)
	at org.springframework.data.redis.core.DefaultHashOperations.lambda$put$14(DefaultHashOperations.java:254)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:224)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:191)
	at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97)
	at org.springframework.data.redis.core.DefaultHashOperations.put(DefaultHashOperations.java:253)



	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 5 second(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:59)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:74)
	at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1083)
	... 22 common frames omitted`

Expected behavior/code

the expected behaviour is that with periodic_refresh, the issue does not persist.

Environment

aws redis serverless 7.1
<lettuce.version>6.3.2.RELEASE</lettuce.version>
<netty.version>4.1.109.Final</netty.version>

related configurations:
ssl: true
timeout: 5s
connect-timeout: 500ms
additional:
dns-ttl: 5s
periodic-refresh: 10s
reconnect-delay-min: 100ms
reconnect-delay-max: 5s
read-from: replicaPreferred

Possible Solution

RedisCommandTimeoutException intermittently:
in some cases this could be related, as we sometimes see timeouts for a minute, and they are recovered, however in this case it persists.

Lettuce cannot recover from connection problems:
maybe setting TCP_USER_TIMEOUT would help, default is 1 minutes, so for the recovering case, maybe this helps, but for the cases that this continues and gets stacked probably not.

tishun · 2024-11-01T19:58:40Z

This would be a hard problem to help with.

I can't comment on why you are getting timeouts, as this seems to be related to the infrastructure (especially since this started to happen after changing the infra). In this case an increased latency is unavoidable, as the application does not have a way to handle this differently. Periodic refresh would help for new requests, but the ones that have been sent already need to all timeout. A TCP_USER_TIMEOUT would probably help, especially if you are not getting an ACK from the server (you could verify that by sniffing your traffic).

Now depending on the load you might get your commands queue getting filled quite rapidly and reach OOM issues. Again this is avoidable with configuration tweaking (lower timeout, increase memory).

All of these are generic advises though. To help more we would need details picture of what happens and ideally a way to reproduce it.

mshahmoradi87 · 2024-11-11T11:35:04Z

Thanks for the hints,

this is hard to reproduce, and we tried multiple times in our dev env to reproduce and no luck. We see it only happening in prod env. Also, AWS have improved, so this issue was happening 2–3 times a week, reduced to 1 a month and did not happen now for about 6 weeks.

One thing to clarify, is that when timeouts start it never ends and recover it this special case, there are multiple instances of a task running, eg 12 task in 3 Availability Zones, and there are even other services with multiple tasks connecting to same Redis cluster in some cases and only one task faces this issue, and this happens after we see some reconnect logs.

My expectation is that with periodic refresh, the new request should start to work, but they don't, so somehow periodic refresh does not work and get stocked.

We have enabled compression and tweaked some configs around memory, and we don't see OOM any more.

tishun · 2024-11-12T06:42:52Z

the expected behaviour is that with periodic_refresh, the issue does not persist.

My expectation is that with periodic refresh, the new request should start to work, but they don't, so somehow periodic refresh does not work and get stocked.

Periodic refresh does not affect already queued commands. So if the service experienced a slowdown just before stopping the queue might be full with a lot of commands (and this is likely true due to the fact you experienced OOM). In this case even if the periodic refresh detects the node is down already queued commands are still executed against it.

The periodic refresh would only affects newly executed commands.

This behavior is by design - we are not sure why a command ended up being sent to a specific node (whether it was a MOVED response, design choice by the app developer or random) so we can't be sure where we need to resend it.

mshahmoradi87 · 2024-11-12T09:21:54Z

The periodic refresh would only affects newly executed commands.

In this case, when timeouts start, it does not stop. All new requests are also timing out, and there is no reconnect. There were cases this was ongoing for 3-4 hours.

tishun · 2024-11-12T20:34:21Z

All new requests are also timing out, and there is no reconnect.

How do you verify that? Did you use the event bus to monitor for reconnection events?
Does the periodic refresh result in new topology being retrieved?

What is your strategy to trace the problem? Did you start recording more verbose (debug) log messages?

mshahmoradi87 · 2024-11-13T10:53:36Z

All new requests are also timing out, and there is no reconnect.

How do you verify that? Did you use the event bus to monitor for reconnection events? Does the periodic refresh result in new topology being retrieved?

What is your strategy to trace the problem? Did you start recording more verbose (debug) log messages?

We log reconnect events, when timeouts start in a specific container, I don't see any reconnect after that in that container. all requests put/get are getting timed out, and I know they are new, as I see other application logs of new event with initiate timestamp being processed and after that Redis time out, and then I see s3 call, that is in case of timeout or cache miss.

Furthermore, old event should expire after few minutes, as I said, when this happens it does not stop, and I see timeout continuously.

For tracing, I have enabled profiling, sth similar to getting heap dump when we had the issue with memory. As this was happening in production env only, I can't put in debug mode, both from cost and PCI perspective.

tishun · 2024-11-13T16:00:24Z

We log reconnect events, when timeouts start in a specific container, I don't see any reconnect after that in that container. all requests put/get are getting timed out, and I know they are new, as I see other application logs of new event with initiate timestamp being processed and after that Redis time out, and then I see s3 call, that is in case of timeout or cache miss.

Reconnect would happen only if the remote service recovers. I assume the correct thing to expect would be for the driver to - instead - route newly issued commands to other nodes, after it refreshes the topology.

Furthermore, old event should expire after few minutes, as I said, when this happens it does not stop, and I see timeout continuously.

If newly issued commands (after the topology change) are getting expired than it is reasonable to assume that the driver never picks up the new topology and thus continues to send commands to the old (failed) node. Have you verified if the remote address for the commands is changing or is it the same? Have you enabled some debug logging to check if the topology is actually refreshed?

You can do that when you subscribe to the event bus and listen for events such as AdaptiveRefreshTriggeredEvent (if adaptive refresh is enabled) or ClusterTopologyChangedEvent (or just listen for all events and log them just in case). Sample code:

    public static void main(String[] args) {
        RedisURI redisURI = RedisURI.Builder
                .redis("acme.com")
                .withPort(1234)
                .build();

        try (RedisClient client = RedisClient.create(redisURI)) {

            EventBus eventBus = client.getResources().eventBus();
            eventBus.get().subscribe(e -> {
                logger.info(e.toString());
            });
...

At this point this is the most reasonable assumption (for some reason topology is not being refreshed).

Does the driver report topology refresh events? Enabling debug messages should output Using a new cluster topology after a new topology is retrieved. You can enable debug mode only on the io.lettuce.core.cluster package, or even the io.lettuce.core.cluster.RedisClusterClient specifically to reduce the noise.

tishun added for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage labels Oct 11, 2024

tishun removed for: team-attention An issue we need to discuss as a team to make progress status: waiting-for-triage labels Nov 1, 2024

tishun added the status: waiting-for-feedback We need additional information before we can continue label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

None recovering Redis command timed out #3010

None recovering Redis command timed out #3010

mshahmoradi87 commented Oct 11, 2024 •

edited by tishun

Loading

tishun commented Nov 1, 2024

mshahmoradi87 commented Nov 11, 2024 •

edited

Loading

tishun commented Nov 12, 2024

mshahmoradi87 commented Nov 12, 2024 •

edited

Loading

tishun commented Nov 12, 2024

mshahmoradi87 commented Nov 13, 2024 •

edited

Loading

tishun commented Nov 13, 2024

None recovering Redis command timed out #3010

None recovering Redis command timed out #3010

Comments

mshahmoradi87 commented Oct 11, 2024 • edited by tishun Loading

Bug Report

Current Behavior

Expected behavior/code

Environment

Possible Solution

tishun commented Nov 1, 2024

mshahmoradi87 commented Nov 11, 2024 • edited Loading

tishun commented Nov 12, 2024

mshahmoradi87 commented Nov 12, 2024 • edited Loading

tishun commented Nov 12, 2024

mshahmoradi87 commented Nov 13, 2024 • edited Loading

tishun commented Nov 13, 2024

mshahmoradi87 commented Oct 11, 2024 •

edited by tishun

Loading

mshahmoradi87 commented Nov 11, 2024 •

edited

Loading

mshahmoradi87 commented Nov 12, 2024 •

edited

Loading

mshahmoradi87 commented Nov 13, 2024 •

edited

Loading