-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Elevated Latency in pipeline SET and MGET calls with version 9.5.2 or higher #3282
Comments
@tim-zhu-reddit thank you for reporting this. Have any changes to the redis server itself happened while this latency increase was observed? Was the change in latency solely based on the change of go-redis version? |
Also, if you are correct that the bug is present in |
@ndyakov thank you for taking a look! |
@ndyakov : we started to notice the issue when we upgraded client lib from v9.2.0 to v9.7.0. after noticed the issue, we did a "bisect" on the releases between them, and found out that v9.5.1 is the last good one and v9.5.2 is the first bad one. if you think #2961 is the only one could be relevant in 9.5.2 release, then we can try testing 9.7.0 with revert of that one applied (assuming it's still a clean revert) to see if we can reproduce the bug. |
Update: We tested 9.7.0 with #2961 reverted (see https://github.com/fishy/go-redis/commits/38b8f52b4d563cb0c07557c254e4a9b16d25674e/), and verified that version does not have the same issue. So the bug should be somewhere inside #2961. |
@fishy: that is interesting. Will look into it, see if I can prepare benchmark test that will catch the degradation of performance and see how to proceed. I expect this to take some time, in the meantime I hope the approach with the reverted PR works for you. By the way, do you have timeouts on the context that you are passing or are you canceling the context by any chance ? |
Understood. @fishy are you observing something additional in the node logs or go-redis logs? I assume we should add more logging around this piece of code with the |
@ndyakov I looked into the server-side metrics for two tests we ran with the new version yesterday.
below is a screenshot of the read and write request rate and latency by redis nodes on https://pkg.go.dev/github.com/go-redis/redis#ClusterOptions, we set readonly = true and routeRandomly = true. could #2961 cause a behaviour change in routing read requests? |
Hm.. that is quite strange that you are observing. It is interesting to see more equal distribution but higher latency. I am still focused on another feature, but would like to dig deeper into the issue since we have pinpointed the exact patch that triggers it. It looks like the package is marking the nodes as unhealthy and rerouting requests to other nodes. What is quite strange for me is that during normal periods it doesn't seem that you have equal distribution of read requests. Maybe there is something in the application code that is reusing a sticky connection? (cc @tim-zhu-reddit , @fishy ) |
@ndyakov Server side metrics look similar to #3190 PR. It may be the case when error (may be context error) happens in pipelineReadCmds method and isBadConnection() is marking node as failed. May be @fishy or @tim-zhu-reddit can share more information about the client configuration options and timeouts set. |
@EXPEbdodla thank you for helping, here is our config |
@ndyakov, good question. I double check our code, we don't have any logic around sticky connections. since we set |
Expected Behavior
We use redis pipelining (https://pkg.go.dev/github.com/go-redis/redis#Pipeline.Exec) to execute a series of SET commands.
We expected the Pipeline.Exec() call to take less than 200ms at p99, 2ms at p50
We use MGET(https://pkg.go.dev/github.com/go-redis/redis#Ring.MGet) to get keys from redis, expect latency at 1ms p50, 100ms p99 at peak times.
Current Behavior
With Version 9.5.2(https://github.com/redis/go-redis/releases/tag/v9.5.2) and higher, P99 latency spiked above 4s for pipeline SET, causing timeout. and 2s for MGET, while p50 latency remains steady under 2ms and 1ms respectively.
We first noticed the issue after upgrading the version from 9.2.0 to 9.7.0. We bisect-tested different versions and determined 9.5.2 was the first bad version with increased latency.
Possible Solution
Steps to Reproduce
Context (Environment)
Our use redis as a simple dudupe layer, acting like a LRU cache. When we see a new name, we write to Redis, then check redis to see if we have seen this name.
peak hours QPS are 350k for MGET and 25k for pipeline request(max 25 SET command in each pipeline request)
This latency issue seem to impact high QPS service like this, as we didn't hear other internal teams using the same go-redis version reporting a similar issue.
Detailed Description
Possible Implementation
The text was updated successfully, but these errors were encountered: