Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Elevated Latency in pipeline SET and MGET calls with version 9.5.2 or higher #3282

Open
tim-zhu-reddit opened this issue Feb 19, 2025 · 14 comments
Assignees
Labels

Comments

@tim-zhu-reddit
Copy link

Expected Behavior

We use redis pipelining (https://pkg.go.dev/github.com/go-redis/redis#Pipeline.Exec) to execute a series of SET commands.
We expected the Pipeline.Exec() call to take less than 200ms at p99, 2ms at p50
We use MGET(https://pkg.go.dev/github.com/go-redis/redis#Ring.MGet) to get keys from redis, expect latency at 1ms p50, 100ms p99 at peak times.

Current Behavior

With Version 9.5.2(https://github.com/redis/go-redis/releases/tag/v9.5.2) and higher, P99 latency spiked above 4s for pipeline SET, causing timeout. and 2s for MGET, while p50 latency remains steady under 2ms and 1ms respectively.
We first noticed the issue after upgrading the version from 9.2.0 to 9.7.0. We bisect-tested different versions and determined 9.5.2 was the first bad version with increased latency.

Image
Image

Possible Solution

Steps to Reproduce

  1. change version to 9.5.2 or higher
  2. deploy service as usual
  3. observe p99 latency jumps immediately for pipeline SET and MGET requests

Context (Environment)

Our use redis as a simple dudupe layer, acting like a LRU cache. When we see a new name, we write to Redis, then check redis to see if we have seen this name.
peak hours QPS are 350k for MGET and 25k for pipeline request(max 25 SET command in each pipeline request)

This latency issue seem to impact high QPS service like this, as we didn't hear other internal teams using the same go-redis version reporting a similar issue.

Detailed Description

Possible Implementation

@tim-zhu-reddit tim-zhu-reddit changed the title Elevated Latency in pipeline SET and MGET calls with version 9.5.2 or higher [BUG] Elevated Latency in pipeline SET and MGET calls with version 9.5.2 or higher Feb 19, 2025
@ndyakov
Copy link
Collaborator

ndyakov commented Feb 19, 2025

@tim-zhu-reddit thank you for reporting this. Have any changes to the redis server itself happened while this latency increase was observed? Was the change in latency solely based on the change of go-redis version?

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 19, 2025

Also, if you are correct that the bug is present in v9.5.2 but not in v.9.5.1 the only change around the pipeline code that I tracked and can start investigating is this one: #2961

@tim-zhu-reddit
Copy link
Author

tim-zhu-reddit commented Feb 19, 2025

Have any changes to the redis server itself happened while this latency increase was observed? Was the change in latency solely based on the change of go-redis version?

@ndyakov thank you for taking a look!
we didn't have any server changes. and since we could reproduce the issue consistently with 9.5.2 or higher and earlier versions consistently working, we think it is due some a change in v9.5.2

@fishy
Copy link
Contributor

fishy commented Feb 19, 2025

@ndyakov : we started to notice the issue when we upgraded client lib from v9.2.0 to v9.7.0.

after noticed the issue, we did a "bisect" on the releases between them, and found out that v9.5.1 is the last good one and v9.5.2 is the first bad one.

if you think #2961 is the only one could be relevant in 9.5.2 release, then we can try testing 9.7.0 with revert of that one applied (assuming it's still a clean revert) to see if we can reproduce the bug.

@fishy
Copy link
Contributor

fishy commented Feb 19, 2025

Update: We tested 9.7.0 with #2961 reverted (see https://github.com/fishy/go-redis/commits/38b8f52b4d563cb0c07557c254e4a9b16d25674e/), and verified that version does not have the same issue.

So the bug should be somewhere inside #2961.

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 20, 2025

@fishy: that is interesting. Will look into it, see if I can prepare benchmark test that will catch the degradation of performance and see how to proceed. I expect this to take some time, in the meantime I hope the approach with the reverted PR works for you.

By the way, do you have timeouts on the context that you are passing or are you canceling the context by any chance ?

@ndyakov ndyakov added the bug label Feb 20, 2025
@ndyakov ndyakov self-assigned this Feb 20, 2025
@ndyakov
Copy link
Collaborator

ndyakov commented Feb 20, 2025

@fishy if you are using cancellation of the context, would you please test against #3190 to see if the issue persists?

@fishy
Copy link
Contributor

fishy commented Feb 20, 2025

@ndyakov we tested and #3190 does not fix the issue we are seeing, the latency is still high with that version.

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 21, 2025

Understood. @fishy are you observing something additional in the node logs or go-redis logs? I assume we should add more logging around this piece of code with the markNodeAsFailing in the said PR to observe which flow is introducing the increased latency. Do you think we can work together on a benchmark test that reflects your usage so we can catch the difference in performance?

@tim-zhu-reddit
Copy link
Author

tim-zhu-reddit commented Feb 21, 2025

@ndyakov I looked into the server-side metrics for two tests we ran with the new version yesterday.
it seems the elevated latency may have come from the read requests

  • I see a change in read traffic and latency, while set latency remain consistant
  • some redis nodes that used to get 0 requests suddenly had read traffic, while previous higher traffic nodes had their requests dropped abit. it almost looked like it became a more even request rate among nodes during the high latency period.

below is a screenshot of the read and write request rate and latency by redis nodes
Image

on https://pkg.go.dev/github.com/go-redis/redis#ClusterOptions, we set readonly = true and routeRandomly = true. could #2961 cause a behaviour change in routing read requests?

@ndyakov
Copy link
Collaborator

ndyakov commented Mar 5, 2025

Hm.. that is quite strange that you are observing. It is interesting to see more equal distribution but higher latency. I am still focused on another feature, but would like to dig deeper into the issue since we have pinpointed the exact patch that triggers it.

It looks like the package is marking the nodes as unhealthy and rerouting requests to other nodes. What is quite strange for me is that during normal periods it doesn't seem that you have equal distribution of read requests. Maybe there is something in the application code that is reusing a sticky connection? (cc @tim-zhu-reddit , @fishy )

@EXPEbdodla
Copy link
Contributor

@ndyakov Server side metrics look similar to #3190 PR. It may be the case when error (may be context error) happens in pipelineReadCmds method and isBadConnection() is marking node as failed. May be @fishy or @tim-zhu-reddit can share more information about the client configuration options and timeouts set.

@tim-zhu-reddit
Copy link
Author

@EXPEbdodla thank you for helping, here is our config
ReadOnly - true
RouteRandomly - true
ReadTimeout: 3s
DialTimeout: 3s
WriteTimeout: 3s
other https://pkg.go.dev/github.com/go-redis/redis#ClusterOptions are unset

@tim-zhu-reddit
Copy link
Author

Maybe there is something in the application code that is reusing a sticky connection?

@ndyakov, good question. I double check our code, we don't have any logic around sticky connections. since we set RouteRandomly = true in the ClusterOption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants