Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only check latencies once every 10 seconds with routeByLatency #2795

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

justinmir
Copy link

@justinmir justinmir commented Nov 10, 2023

routeByLatency currently checks latencies any time a server returns a MOVED or READONLY reply. When a shard is down, the ClusterClient chooses to issue the request to a random server, which returns a MOVED reply. This causes a state refresh and a latency update on all servers. This can lead to significant ping load to clusters with a large number of clients.

This introduces logic to ping only once every 10 seconds, only performing a latency update on a node during the GC function if the latency was set later than 10 seconds ago.

Fixes #2782

image
Figure: Ping behavior of the client running 21bd40a and a client running this PR. When shards are failed the current cluster client will spam pings while the fixed cluster client will only ping each server once every 10 seconds.

This shows the impact in a running large production cluster. The cluster is handling ~4M pings per second due to this behavior.
image

`routeByLatency` currently checks latencies any time a server returns
a MOVED or READONLY reply. When a shard is down, the ClusterClient
chooses to issue the request to a random server, which returns a MOVED
reply. This causes a state refresh and a latency update on all servers.
This can lead to significant ping load to clusters with a large number
of clients.

This introduces logic to ping only once every 10 seconds, only
performing a latency update on a node during the `GC` function if the
latency was set later than 10 seconds ago.

Fixes redis#2782
@justinmir justinmir marked this pull request as ready for review November 10, 2023 19:47
@ofekshenawa
Copy link
Collaborator

LGTM!
WDYT about changing Unix() to NanoUnix? To be more precise and to avoid unnecessary loops

@justinmir
Copy link
Author

Sure I'll push that change shortly.

@justinmir
Copy link
Author

@ofekshenawa PTAL when you get a chance!

@justinmir
Copy link
Author

@vladvildanov hoping to get some eyes here, this will help us no longer have to maintain our own fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RouteByLatency leads to significant ping load when all servers of a shard are failed
2 participants