Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our journey switching from twemproxy to Envoy Redis Proxy (and some perf issues) #19436

Closed
bkadatz opened this issue Jan 6, 2022 · 16 comments
Closed
Labels
area/perf area/redis stale stalebot believes this issue/PR has not been touched recently

Comments

@bkadatz
Copy link

bkadatz commented Jan 6, 2022

Some background: we have 20 Redis instances serving our API plus various other backend jobs (collectively, our "applications"). In front of this, we had 10 twemproxy instances proxying traffic based on consistent hash of the key. All of this is within Kubernetes. Our applications access the proxy via Kubernetes internal service based on a label selector. Our Redis instances are setup as a statefulset with each pod in the set having a custom instance name, from redis-shard-0 through redis-shard-19.

twemproxy configuration

default:
  listen: 0.0.0.0:6379
  hash: fnv1a_64
  hash_tag: "{}"
  distribution: ketama
  auto_eject_hosts: false
  timeout: 400
  redis: true
  preconnect: true
  servers:
    - redis-shard-0:6379:1 redis-shard-0 
    - redis-shard-1:6379:1 redis-shard-1 
    - redis-shard-2:6379:1 redis-shard-2 
    - redis-shard-3:6379:1 redis-shard-3 
    - redis-shard-4:6379:1 redis-shard-4 
    - redis-shard-5:6379:1 redis-shard-5 
    - redis-shard-6:6379:1 redis-shard-6 
    - redis-shard-7:6379:1 redis-shard-7 
    - redis-shard-8:6379:1 redis-shard-8 
    - redis-shard-9:6379:1 redis-shard-9 
    - redis-shard-10:6379:1 redis-shard-10 
    - redis-shard-11:6379:1 redis-shard-11 
    - redis-shard-12:6379:1 redis-shard-12 
    - redis-shard-13:6379:1 redis-shard-13 
    - redis-shard-14:6379:1 redis-shard-14 
    - redis-shard-15:6379:1 redis-shard-15 
    - redis-shard-16:6379:1 redis-shard-16 
    - redis-shard-17:6379:1 redis-shard-17 
    - redis-shard-18:6379:1 redis-shard-18 
    - redis-shard-19:6379:1 redis-shard-19 

Envoy Redis Proxy

My first attempt was to setup a basic Envoy configuration with the Redis Proxy filter and use 10 instances as well, figuring that someone had already figured out an appropriate scale and I might as well use the same. The setup looked like this (using latest, v1.21):

admin:
  access_log:
    name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /tmp/admin_access.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901
node:
  id: redis-envoy
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 6379 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.redis_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy
          stat_prefix: egress_redis
          settings:
            op_timeout: 1s
            enable_hashtagging: true
            enable_redirection: false
            enable_command_stats: false
          prefix_routes:
            catch_all_route:
              cluster: redis-cluster
  clusters:
    - name: redis-cluster
      connect_timeout: 0.5s
      type: STRICT_DNS
      dns_lookup_family: V4_ONLY
      lb_policy: RING_HASH
      load_assignment:
        cluster_name: redis
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-0, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-1, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-2, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-3, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-4, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-5, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-6, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-7, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-8, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-9, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-10, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-11, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-12, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-13, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-14, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-15, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-16, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-17, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-18, port_value: 6379 }
              - endpoint:
                  address:
                    socket_address: { address: redis-shard-19, port_value: 6379 }
cluster_manager:
  outlier_detection:
    event_log_path: /dev/stdout
layered_runtime:
  layers:
    - name: admin_layer
      admin_layer: {}

The ketama key distribution in twemproxy maps to the RING_HASH lb_policy in Envoy, and the other settings are quite similar though I did increase the timeout somewhat in an attempt to be conservative. I switched this over (changing the internal service's selector from our twemproxy label to use the redis-envoy label) during a low traffic period between Christmas and New Year's and everything seemed stable.

Problem

When our traffic ramped back up in early January, we started experiencing latency issues. Redis queries which were taking single digit milliseconds were now timing out after 1 second. I switched back to twemproxy and researched our configuration a little more.

After a few days, I noticed that the buffering setting isn't enabled by default, so added the following to our redis proxy settings:

            max_buffer_size_before_flush: 1024
            buffer_flush_timeout: 0.003s

This helped a little, but we still saw a spike in latency and I switched back to twemproxy again.

Our p95 latency for a PUT operation on Redis:
Screen Shot 2022-01-05 at 1 51 24 PM
On the left you can see it exceeded 1 second, and then at the right you can see there was still a spike even after adding the buffering settings.

What I suspect was the issue here is that the average number of connected clients per Redis instance jumped from 11 (10 twemproxy + 1 monitoring) to 161 (10 x 16 Envoy threads + 1 monitoring) and Envoy experienced additional contention with the 160 outbound threads.

Rather than running 10 low-spec instances that were identical to twemproxy's settings (1 CPU / 2 GB RAM), I switched this to run 3 higher spec instances with 4 GB CPU, but still 2 GB RAM. After this, things stabilized though there are still a few areas of concern.

Some graphs, questions

Screen Shot 2022-01-05 at 1 21 41 PM
twemproxy seems to maintain only a single connection to each Redis instance and proxies all commands through that connection. Envoy maintains 16 connections to each Redis instance.

Screen Shot 2022-01-05 at 1 23 05 PM
Even after achieving stability with Envoy after increasing its resources, the number of commands executed on Redis has increased substantially despite the same workload. The left of the graph is on twemproxy, the right is after switching over to Envoy. Why would the same workload result multiply the number of executed commands? Does Redis Proxy open a new connection per command and then issue a QUIT afterwards whereas twemproxy maintains a persistent connection?

Screen Shot 2022-01-05 at 1 25 21 PM
The graph of redis operations/second largely mirrors the number of commands. Here, too, we see a multiple of what it was after switching from twemproxy.

Screen Shot 2022-01-06 at 2 57 26 PM
Our p95 latency is now roughly similar to what it was a week ago when we ran twemproxy, though Envoy is still a few milliseconds slower. I suspect if I increase the CPU allocation from 4 to 8 that this will get us much closer to twemproxy's performance.

Summary and questions

Overall, switching was relatively painless despite a couple of challenges. The differences in running twemproxy vs Envoy added a few surprises, which could be addressed via a migration guide. I'm happy to draft something up -- let me know if that'd be useful.

So some outstanding questions based on the above:

  1. Is there anything in the configuration which is either incorrect or could be improved?
  2. Does Envoy have a specific recommendation in terms of CPU sizing? Note that we don't specify a --concurrency setting. Should we?
  3. Why are we seeing a tripling of the number of commands despite a constant workload?
  4. Anything else you might suggest to improve overall performance?
@bkadatz bkadatz added the triage Issue requires triage label Jan 6, 2022
@alyssawilk
Copy link
Contributor

@alyssawilk alyssawilk added area/perf area/redis and removed triage Issue requires triage labels Jan 10, 2022
@alyssawilk
Copy link
Contributor

cc @wbpcode

@wbpcode
Copy link
Member

wbpcode commented Jan 10, 2022

🤔 It would help us understand this better if you could provide some runtime Envoy metrics (connection changes, some metrics provided by the Redis proxy itself, etc.).

Note that we don't specify a --concurrency setting. Should we?

it is recommend to set --concurrency based on the CPU resource that allocated to Envoy.

Why would the same workload result multiply the number of executed commands?

How do you count the commands ?
I am not familiar with Redis Proxy. But could it have been because envoy had split a set of Redis commands?

Does Redis Proxy open a new connection per command and then issue a QUIT afterwards whereas twemproxy maintains a persistent connection?

Envoy will create only one persistent connection for one host in one worker.

@moderation
Copy link
Contributor

it is recommend to set --concurrency based on the CPU resource that allocated to Envoy.

The most detailed documentation I've found on this is from @mattklein123 's old article at https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310#b30c.

One major takeaway however is that from a memory and connection pool efficiency standpoint, it is actually quite important to tune the--concurrency option. Having more workers than is needed will waste memory, create more idle connections, and lead to a lower connection pool hit rate. At Lyft our sidecar Envoys run with very low concurrency so that the performance roughly matches the services they are sitting next to. We only run our edge Envoys at max concurrency.

Tuning --concurrency is particularly important in Kubernetes environments that are running on VM's. For example if your Envoy container is running on a VM where the underlying node has 64 cores, the concurrency level will be detected as 64. With Redis being single-threaded, I suspect tuning this value will be important

@bkadatz
Copy link
Author

bkadatz commented Jan 11, 2022

it is recommend to set --concurrency based on the CPU resource that allocated to Envoy.

For example if your Envoy container is running on a VM where the underlying node has 64 cores, the concurrency level will be detected as 64.

Thanks, I'll tweak this value.

Why would the same workload result multiply the number of executed commands?

How do you count the commands ?

The number of commands shown in the graph is what Redis is reporting itself via Datadog's Redis agent. The docs claim "The number of commands processed by the server", but ​I'll see if I can get some more granular information on this.

I am not familiar with Redis Proxy. But could it have been because envoy had split a set of Redis commands?

It's possible, but Envoy's Redis Proxy should split the commands in the same manner as twemproxy -- based on the key and using hash tagging.

In testing this out, both twemproxy and Envoy's Redis Proxy treated keys similarly. For example, using either proxy, both {foo} and {foo}:bar routed to the same Redis instance based on reducing the key down to just foo. And even with a different distribution of which servers were executing which commands, the overall number of commands should be relatively stable.

@bkadatz
Copy link
Author

bkadatz commented Jan 11, 2022

With an explicitly set concurrency setting, performance is now substantially faster than it was yesterday and roughly equivalent to what we had with twemproxy:
Screen Shot 2022-01-10 at 5 19 04 PM

Plus, we are using fewer (and now smaller) Envoy instances which is a net win.

Regarding the number of commands, it looks like the Datadog agent is simply reporting on the output of the INFO ALL command: https://redis.io/commands/INFO

Running this directly on one of the redis instances, I see that Redis reports on the number of invocations for each command, eg:

# Commandstats
cmdstat_eval:calls=1099,usec=61889,usec_per_call=56.31,rejected_calls=0,failed_calls=0
cmdstat_psetex:calls=1160167,usec=15794727,usec_per_call=13.61,rejected_calls=0,failed_calls=0
cmdstat_config:calls=767282,usec=22697494,usec_per_call=29.58,rejected_calls=0,failed_calls=0
cmdstat_flushall:calls=11,usec=2006,usec_per_call=182.36,rejected_calls=0,failed_calls=0
cmdstat_incrby:calls=1402722,usec=10928190,usec_per_call=7.79,rejected_calls=0,failed_calls=0
cmdstat_incr:calls=697707376,usec=3734666296,usec_per_call=5.35,rejected_calls=0,failed_calls=0
cmdstat_expire:calls=1395414752,usec=3971559428,usec_per_call=2.85,rejected_calls=0,failed_calls=0
cmdstat_mget:calls=6226308834,usec=144174591126,usec_per_call=23.16,rejected_calls=0,failed_calls=0
cmdstat_spop:calls=74947405,usec=767634533,usec_per_call=10.24,rejected_calls=0,failed_calls=0
cmdstat_getset:calls=29403,usec=292561,usec_per_call=9.95,rejected_calls=0,failed_calls=0
cmdstat_flushdb:calls=2,usec=85,usec_per_call=42.50,rejected_calls=0,failed_calls=0
cmdstat_get:calls=2814677984,usec=20185931257,usec_per_call=7.17,rejected_calls=0,failed_calls=0
cmdstat_evalsha:calls=717108285,usec=58895691062,usec_per_call=82.13,rejected_calls=0,failed_calls=4
cmdstat_del:calls=12336429,usec=50549258,usec_per_call=4.10,rejected_calls=0,failed_calls=0
cmdstat_exists:calls=10149881,usec=35472128,usec_per_call=3.49,rejected_calls=0,failed_calls=0
cmdstat_info:calls=383643,usec=58784693,usec_per_call=153.23,rejected_calls=0,failed_calls=0
cmdstat_slowlog:calls=383641,usec=511825221,usec_per_call=1334.13,rejected_calls=0,failed_calls=0
cmdstat_set:calls=732974848,usec=6283674386,usec_per_call=8.57,rejected_calls=0,failed_calls=0
cmdstat_sadd:calls=37489181,usec=470433777,usec_per_call=12.55,rejected_calls=0,failed_calls=0

I'll try swapping back to twemproxy and then again to Envoy and see where the difference lies as this should shed some light on the difference in reported command invocations and the instantaneous ops/second metric.

@bkadatz
Copy link
Author

bkadatz commented Jan 11, 2022

Okay, so I collected some stats after switching back to twemproxy and then again running via Envoy. After normalizing for the time difference to produce a constant rate the main difference is in how each system handles GET vs MGET.

twemproxy:

  • MGET: 41,606 commands issued
  • GET: 13,628 commands issued.

Envoy:

  • MGET: 13,618 commands issued
  • GET: 74,711 commands issued.

The similarity of twemproxy's GET number to Envoy's MGET number is interesting and I double-checked that I didn't transpose the stats, so this is correct. One theory is that Redis Proxy in Envoy is passing through single-key MGET commands as MGET, but multi-key MGET commands are converted into a series of single-key GET commands.

Maybe someone knows offhand about the differences in how each handles processing of multiple keys? If not, I'll investigate further.

@wbpcode
Copy link
Member

wbpcode commented Jan 11, 2022

@bkadatz I have done a quick check to the Envoy codes. Envoy will splits MGET command to multiple GET. So there should be no MGET for Envoy. 🤔

/**
 * MGETRequest takes each key from the command and sends a GET for each to the appropriate Redis
 * server. The response contains the result from each command.
 */

@bkadatz
Copy link
Author

bkadatz commented Jan 11, 2022

I double-checked the stats and since switching back exclusively to Envoy about an hour ago and shutting down twemproxy, there haven't been any new MGET commands reported on the instance I've been checking. I probably messed something up in leaving twemproxy running unnecessarily when retrieving these stats.

But as this now explains the command count difference and there aren't any remaining performance issues, it looks like things are in a very good state.

Thanks @wbpcode and @moderation for the helpful information!

@moderation
Copy link
Contributor

Glad to help. I suspect there are a ton of people running Envoy in Kubernetes inheriting the underlying node CPU counts and therefore having way more threads than they require

@rojkov
Copy link
Member

rojkov commented Jan 11, 2022

Also you may consider pinning Envoy to a certain CPU core(s) with taskset to reduce cache trashing a bit. In Kubernetes with kubelets configured to use the static CPU management policy this should work automatically.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 10, 2022
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@romange
Copy link

romange commented Jan 1, 2023

@bkadatz I've stumbled upon this issue. I do not know if it's still relevant and whether you still have problems with running your Redis cluster but in case it still has hiccups - take a look at DragonflyDB. Given the size of your workload you can replace your whole cluster with a single Dragonfly instance and you won't need an envoy proxy at all since it may replace your whole cluster together with the proxy.

@javedsha
Copy link

@bkadatz - what concurrency did you set? Let's say if we have a VM with 8 cores, what is the recommended value?

@bkadatz
Copy link
Author

bkadatz commented Jan 19, 2024

@javedsha see the advice here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/perf area/redis stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

7 participants