-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Our journey switching from twemproxy to Envoy Redis Proxy (and some perf issues) #19436
Comments
cc @wbpcode |
🤔 It would help us understand this better if you could provide some runtime Envoy metrics (connection changes, some metrics provided by the Redis proxy itself, etc.).
it is recommend to set
How do you count the commands ?
Envoy will create only one persistent connection for one host in one worker. |
The most detailed documentation I've found on this is from @mattklein123 's old article at https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310#b30c.
Tuning |
Thanks, I'll tweak this value.
The number of commands shown in the graph is what Redis is reporting itself via Datadog's Redis agent. The docs claim "The number of commands processed by the server", but I'll see if I can get some more granular information on this.
It's possible, but Envoy's Redis Proxy should split the commands in the same manner as twemproxy -- based on the key and using hash tagging. In testing this out, both twemproxy and Envoy's Redis Proxy treated keys similarly. For example, using either proxy, both |
With an explicitly set concurrency setting, performance is now substantially faster than it was yesterday and roughly equivalent to what we had with twemproxy: Plus, we are using fewer (and now smaller) Envoy instances which is a net win. Regarding the number of commands, it looks like the Datadog agent is simply reporting on the output of the Running this directly on one of the redis instances, I see that Redis reports on the number of invocations for each command, eg:
I'll try swapping back to twemproxy and then again to Envoy and see where the difference lies as this should shed some light on the difference in reported command invocations and the instantaneous ops/second metric. |
Okay, so I collected some stats after switching back to twemproxy and then again running via Envoy. After normalizing for the time difference to produce a constant rate the main difference is in how each system handles twemproxy:
Envoy:
The similarity of twemproxy's GET number to Envoy's MGET number is interesting and I double-checked that I didn't transpose the stats, so this is correct. One theory is that Redis Proxy in Envoy is passing through single-key MGET commands as MGET, but multi-key MGET commands are converted into a series of single-key GET commands. Maybe someone knows offhand about the differences in how each handles processing of multiple keys? If not, I'll investigate further. |
@bkadatz I have done a quick check to the Envoy codes. Envoy will splits MGET command to multiple GET. So there should be no MGET for Envoy. 🤔
|
I double-checked the stats and since switching back exclusively to Envoy about an hour ago and shutting down twemproxy, there haven't been any new MGET commands reported on the instance I've been checking. I probably messed something up in leaving twemproxy running unnecessarily when retrieving these stats. But as this now explains the command count difference and there aren't any remaining performance issues, it looks like things are in a very good state. Thanks @wbpcode and @moderation for the helpful information! |
Glad to help. I suspect there are a ton of people running Envoy in Kubernetes inheriting the underlying node CPU counts and therefore having way more threads than they require |
Also you may consider pinning Envoy to a certain CPU core(s) with |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
@bkadatz I've stumbled upon this issue. I do not know if it's still relevant and whether you still have problems with running your Redis cluster but in case it still has hiccups - take a look at DragonflyDB. Given the size of your workload you can replace your whole cluster with a single Dragonfly instance and you won't need an envoy proxy at all since it may replace your whole cluster together with the proxy. |
@bkadatz - what concurrency did you set? Let's say if we have a VM with 8 cores, what is the recommended value? |
Some background: we have 20 Redis instances serving our API plus various other backend jobs (collectively, our "applications"). In front of this, we had 10 twemproxy instances proxying traffic based on consistent hash of the key. All of this is within Kubernetes. Our applications access the proxy via Kubernetes internal service based on a label selector. Our Redis instances are setup as a statefulset with each pod in the set having a custom instance name, from
redis-shard-0
throughredis-shard-19
.twemproxy configuration
Envoy Redis Proxy
My first attempt was to setup a basic Envoy configuration with the Redis Proxy filter and use 10 instances as well, figuring that someone had already figured out an appropriate scale and I might as well use the same. The setup looked like this (using latest, v1.21):
The
ketama
key distribution in twemproxy maps to theRING_HASH
lb_policy in Envoy, and the other settings are quite similar though I did increase the timeout somewhat in an attempt to be conservative. I switched this over (changing the internal service's selector from our twemproxy label to use the redis-envoy label) during a low traffic period between Christmas and New Year's and everything seemed stable.Problem
When our traffic ramped back up in early January, we started experiencing latency issues. Redis queries which were taking single digit milliseconds were now timing out after 1 second. I switched back to twemproxy and researched our configuration a little more.
After a few days, I noticed that the buffering setting isn't enabled by default, so added the following to our redis proxy settings:
This helped a little, but we still saw a spike in latency and I switched back to twemproxy again.
Our p95 latency for a PUT operation on Redis:
On the left you can see it exceeded 1 second, and then at the right you can see there was still a spike even after adding the buffering settings.
What I suspect was the issue here is that the average number of connected clients per Redis instance jumped from 11 (10 twemproxy + 1 monitoring) to 161 (10 x 16 Envoy threads + 1 monitoring) and Envoy experienced additional contention with the 160 outbound threads.
Rather than running 10 low-spec instances that were identical to twemproxy's settings (1 CPU / 2 GB RAM), I switched this to run 3 higher spec instances with 4 GB CPU, but still 2 GB RAM. After this, things stabilized though there are still a few areas of concern.
Some graphs, questions
twemproxy seems to maintain only a single connection to each Redis instance and proxies all commands through that connection. Envoy maintains 16 connections to each Redis instance.
Even after achieving stability with Envoy after increasing its resources, the number of commands executed on Redis has increased substantially despite the same workload. The left of the graph is on twemproxy, the right is after switching over to Envoy. Why would the same workload result multiply the number of executed commands? Does Redis Proxy open a new connection per command and then issue a QUIT afterwards whereas twemproxy maintains a persistent connection?
The graph of redis operations/second largely mirrors the number of commands. Here, too, we see a multiple of what it was after switching from twemproxy.
Our p95 latency is now roughly similar to what it was a week ago when we ran twemproxy, though Envoy is still a few milliseconds slower. I suspect if I increase the CPU allocation from 4 to 8 that this will get us much closer to twemproxy's performance.
Summary and questions
Overall, switching was relatively painless despite a couple of challenges. The differences in running twemproxy vs Envoy added a few surprises, which could be addressed via a migration guide. I'm happy to draft something up -- let me know if that'd be useful.
So some outstanding questions based on the above:
--concurrency
setting. Should we?The text was updated successfully, but these errors were encountered: