Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per listener and per cluster memory overhead is too high #4196

Closed
PiotrSikora opened this issue Aug 18, 2018 · 63 comments · Fixed by #5321, #6299 or #7882
Closed

Per listener and per cluster memory overhead is too high #4196

PiotrSikora opened this issue Aug 18, 2018 · 63 comments · Fixed by #5321, #6299 or #7882
Assignees
Labels
area/perf no stalebot Disables stalebot from closing an issue

Comments

@PiotrSikora
Copy link
Contributor

PiotrSikora commented Aug 18, 2018

As a result of istio/istio#7912, I've been running bunch of experiments, and it looks that the per listener and per cluster memory overhead is a bit too high...

(numbers with --concurrency 1 --disable-hot-restart, include improvements from #4190)

There is ~18KB of memory allocated per each TCP listener (no TLS), ~6KB (~35%) of which is allocated for 5 listener stats (assuming the same stat_prefix is used for all TCP stats). There is also ~1450 bytes of overhead per thread/worker for each TCP listener, ~900 bytes of which come from stats.

There is ~63kB of memory allocated per each HTTP listener (1 route, no TLS), ~18kB (~30%) of which is allocated for 13 listener stats (assuming the same stat_prefix is used for all HTTP stats). There is also ~1350 bytes of overhead per thread/worker for each HTTP listener, ~900 bytes of which come from stats.

There is ~72kB of memory allocated per each static cluster (no TLS, no health-check), ~58kB (~80%) of which is allocated for 72 stats. There is also ~1850 bytes of overhead per thread/worker for each static cluster.

That means that for each listener and cluster pair, there is ~90kB (TCP) or ~135kB (HTTP) of memory allocated, majority of which is allocated for stats (roughly ~800 bytes per counter/gauge and ~3000 bytes per histogram). Those numbers get even worse when we add TLS to the mix, since it adds 8 more stats for each.

I've noticed that @jmarantz and @ambuc are already working on optimizing stats memory usage in #3585, so I won't be duplicating their work, which leaves a few more drastic options, that potentially yield much bigger savings:

  1. add option to disable per listener & per cluster stats (either in config or completely omit them at build-time with #ifdef), this is pretty trivial and a huge win, assuming those stats are not consumed,
  2. delay cluster initialization until first use, though this might get tricky with EDS and HDS,
  3. request CDS from control plane on first use (but that probably wouldn't yield any improvements over 2, would require much bigger changes, and doesn't fit well into eventual consistency mantra).

Thoughts?

cc @htuch @louiscryan @costinm @lizan

@htuch
Copy link
Member

htuch commented Aug 20, 2018

@PiotrSikora thanks for the writeup and data, this is a good start to understanding the problem. I'd address the solution differently for clusters/listeners.

For clusters, we already have a plan-of-record to do (3) in the long term, this has been formalized in the incremental xDS API additions in

// IncrementalDiscoveryRequest and IncrementalDiscoveryResponse are used in a
that @nt and I worked on. This has not been implemented yet, but there is no barrier around eventual consistency I think.

I think (3) is where we want to shoot for in a lasting solution that will scale to very high numbers of clusters (e.g. 100k), (2) would only be a stop gap and not support some applications like serverless that inherently need on-demand xDS. We have internal demand from teams that want this.

I realize that the current goal is to reduce overheads when we are talking about 1-10k clusters. In that case, I think we should just fix stats for now and no other solution is needed. We also have the problem that the LB data structures (depending on LB type) are O(num clusters * num workers * num endpoints) in memory use, which also needs some work.

For listeners, how many listeners that are actually binding to ports do we require? I'd like to understand how the listener count is climbing > O(10) in the application, do we really have O(1k) listeners? Again, stats improvement should be helpful here, but not a silver bullet, where is the rest of the memory coming from?

@PiotrSikora
Copy link
Contributor Author

PiotrSikora commented Aug 21, 2018

I think (3) is where we want to shoot for in a lasting solution that will scale to very high numbers of clusters (e.g. 100k), (2) would only be a stop gap and not support some applications like serverless that inherently need on-demand xDS. We have internal demand from teams that want this.

Agreed, except that (3) is probably months away, and (1) and possibly (2) could be done in a few days.

For listeners, how many listeners that are actually binding to ports do we require? I'd like to understand how the listener count is climbing > O(10) in the application, do we really have O(1k) listeners? Again, stats improvement should be helpful here, but not a silver bullet, where is the rest of the memory coming from?

Istio is using iptables interception and use_original_dst, so we have 1 virtual listener for each TCP destination, but all HTTP destinations are sent to a wildcard listener, where they are routed based on the Host header, so we have only 1 of those for each destination port. Because of that, the listeners are not a huge problem for us (at least not yet), but the per cluster overhead is blocking deployment for our customers, and disabling per cluster stats (that we don't use anyway), which lowers the overhead from ~72kB to ~14kB (so by ~80%) is a huge win, so I'd like to purse this option, since that's the quickest thing to implement.

I realize that the current goal is to reduce overheads when we are talking about 1-10k clusters. In that case, I think we should just fix stats for now and no other solution is needed.

When you say "just fix stats", do you have anything particular in mind? Even if we optimize strings (each fully evaluated stat name resides 3x in memory) with something like #3927, the maximum savings for the service_google_xxxx cluster name used in my experiments are around ~12kB (~17%), compared to ~58kB (~80%) with disabled stats, so the difference is quite significant, and I think that disabling stats would be preferable solution for Istio until we have incremental xDS.

Alternatively, we could delay generateStats() until first access to stats(), which would save the memory for configured but unused clusters.

@jmarantz
Copy link
Contributor

jmarantz commented Aug 21, 2018 via email

@htuch
Copy link
Member

htuch commented Aug 21, 2018

@PiotrSikora (3) should not take months to implement on Envoy side, the implementation is O(1-2 weeks) IMHO. Support on the management side is probably where the bottleneck is.

I think we're in agreement that the biggest benefit will come from reducing cluster stats. I still think fundamentally, given what we're tracking in both key/value teams, the essential memory overhead is more like O(1kb), so we should be able to fix it.

@PiotrSikora
Copy link
Contributor Author

@htuch it seems that the string optimizations didn't help as much as we hoped, and even with #4281, we're still looking at ~69kB overhead for each static cluster, which is way too high.

Could you please reconsider disabling stats (since we don't have any use for them anyway)? This could be achieved by adding either no-op or lazily-loaded Stats::Store.

cc @ambuc

@htuch
Copy link
Member

htuch commented Sep 6, 2018

@PiotrSikora lazy loading Stats::Store seems reasonable if you can do it with low complexity.

@ambuc
Copy link
Contributor

ambuc commented Sep 6, 2018

I think disabling stats via flag / build parameter / etc is the most expedient solution to Istio's problem, and is much easier than incremental XDS / lazy-loaded stats / whatever. There are probably lots of users who don't care about stats, and until we have the right solution, this would work best.

@PiotrSikora
Copy link
Contributor Author

@ambuc yes, disabling per cluster (or all?) stats at compile-time via flags would be by far the fastest thing to implement (and is in fact what I did for the measurements). @htuch would such solution be acceptable to you?

@htuch
Copy link
Member

htuch commented Sep 6, 2018

@PiotrSikora doesn't Istio make use of cluster stats? It's basically impossible to diagnose what goes wrong on Envoy if you don't have cluster level stats. I think almost every user cares about these stats, even if they don't know it.

@jmarantz
Copy link
Contributor

jmarantz commented Sep 6, 2018

I thought a fair number of Envoy users use out-of-the-box binaries. Is it hard to do via flag?

Also is what we want to disable stats? Or just make them non-cluster-specific? Or both?

@mrice32
Copy link
Member

mrice32 commented Sep 6, 2018

I think a flag to disable all stats (or just cluster stats) and a targeted lazy load (lazy loading as a rule for all stats would take a decent amount of work and might add some real overhead to the request path) would both be relatively easy. I think there's already a flag to disable/enable some subset of the stats today, but I can't remember which component.

@htuch I could see a use case where some users might care enough about memory consumption to only turn on stats on a small subset of Envoy instances to get just enough info to create an approximate sample of fleet-wide stats, but I'm not very aware of the typical Envoy deployment today.

@jmarantz
Copy link
Contributor

jmarantz commented Sep 6, 2018

I am not fully understanding the lazy-loading solution; that may be a great way to go. My first thought on how I would disable stats is I'd make a NullStatAllocator which would provide methods to generate NullCounter, NullGauge, NullHistogram, which would have all the methods and do none of the work and take minimal space, for some definition of 'minimal'.

Maybe truncating all stat names to zero characters would make the maps small and fold lots of stats together.

The advantage here is that you could implement this with one new class, and a few injection points during startup. There would be zero extra runtime overhead.

@htuch
Copy link
Member

htuch commented Sep 6, 2018

After some offline chat with @PiotrSikora, it seems that Istio has in practice not had to rely heavily on /stats due to the availability of a number of other diagnosis mechanisms (e.g. Mixer, access logs). A CLI flag to disable (default enable) seems a reasonable optional for now. It doesn't fix the problem - users shouldn't have to chose between stats and scalability, but it unblocks Istio and other folks who need scalability right now.

@mrice32
Copy link
Member

mrice32 commented Sep 6, 2018

@jmarantz That's basically my thought as well. I think we could just create a single NullCounter, NullGauge, etc. Then, the disabled store could give back that one instance to all calls to counter(name), gauge(name), and histogram(name). Assuming these stats are never read back as a part of Envoy's core functionality, we should be able to make all of the increment/set calls into no-ops (maybe a bad assumption...).

I think the lazy loading for clusters could be done as @PiotrSikora suggests above - we could just change when generateStats() gets called (and load it into a unique_ptr). There is one issue with lazy loading, however. We'll need to put a lock around the initialization/check to make it thread-safe, which would add a short duration lock to the request path.

@PiotrSikora
Copy link
Contributor Author

@jmarantz wrt lazily-loading (i.e. allocating on first use) - Istio currently loads routes to all the services in the cluster, which means that even though a particular service might be talking only to a service or two, it has routes and accompanying configuration to all the services, stats for which are allocated but never used, so the savings would be coming from the fact that Envoy is configured with a bunch of clusters that in never uses.

@ramaraochavali
Copy link
Contributor

FWIW, we have debated about this internally as well and we have similar situation like Istio where our Envoys have all the clusters irrespective of whether a service uses a particular cluster or not. One of the options we discussed was to disable mgmt update related metrics like which we really do not care. It might be give us at least 30% reduction and still give the latency/other cluster related metrics for those who want instead of completely disabling entire cluster metrics. Some thing to consider...
membership_change
update_attempt
update_success
update_empty
update_no_rebuild

@ggreenway
Copy link
Contributor

Possible solution: what if, in the config, you set the alt_stat_name to the same string for every cluster? Wouldn't that result in all clusters sharing a common set of stats?

@mattklein123
Copy link
Member

Catching up here. In general I think allowing people to disable stats they don't want to save memory is acceptable. I would prefer a solution though that maybe is a bit more flexible (though might be slower). Something along the lines of allowing people to supply a list of stats they either want or don't want, possibly via regex? People that don't want any stats can just exclude them all? This would incur a penalty at state creation much like tagging, but I think could be done pretty efficiently otherwise, modulo the fact that there will still need to be maps for keeping track of stat names that point to some NullStat interface that goes to dev/null. Thoughts?

@jmarantz
Copy link
Contributor

All of that sounds fine to me. Here's a spectrum of options:

  • Aggressively re-work stat naming so they are not so expensive. E.g. there can be maps, but why do we need a new copy of the name in each map? I'm not at all convinced we've seen the limit of how compact we can get this.
  • Use a regex to specify which stats you want to retain, replacing them with Null stats.
  • Optionally share stats between clusters
  • Drop all stats

@jmarantz
Copy link
Contributor

@PiotrSikora is there an easy way to repro the results you got with a static command line?

@ambuc
Copy link
Contributor

ambuc commented Sep 19, 2018

@ all, there is a WIP PR at #4469 which implements a regex-based exclusion list. This feature would be opt-in, so I feel better about adding a potential performance hit. This PR is not in lieu of a disable-all-stats feature.

@mattklein123
Copy link
Member

Aggressively re-work stat naming so they are not so expensive. E.g. there can be maps, but why do we need a new copy of the name in each map? I'm not at all convinced we've seen the limit of how compact we can get this.

FWIW, I think this is intellectually interesting and if people want to work on it, I'm clearly not going to object, but I think effort is better spent on the other efforts, as well as IMO better educating people on how to use Envoy properly (i.e., let's not just send 150K clusters to an Envoy...).

@jmarantz
Copy link
Contributor

@mattklein123 that makes sense...especially as there are relatively straightforward approaches that will yield immediate gains.

I just did a quick survey and I think most or all of the maps I was worried about are in admin.cc with brief lifetime, although they still might affect things in practice. And actually that's a very small local PR.

The question of whether an Envoy should ever scale to 100k of clusters is an interesting one, and worth getting a definitive answer on before embarking on scale to that. But it seems that even 10k clusters presents scaling challenges.

@mattklein123
Copy link
Member

But it seems that even 10k clusters presents scaling challenges.

IMO even 10K clusters is on the outer edges of what we would expect, at least without moving to incremental with cluster TTLs, but I would be curious what people think about this. In any case, I think we are all in agreement that starting with allowing for stat pruning to get rid of stats people don't care about is probably the way to go? I'm guessing this will save a huge amount of memory, though I do worry about the perf impact but that is easy to measure.

@PiotrSikora
Copy link
Contributor Author

PiotrSikora commented Sep 21, 2018

@jmarantz

Generate test files:

$ cat /tmp/generate-config.sh
echo "admin:
  access_log_path: /dev/null
  address:
    socket_address: { address: 127.0.0.1, port_value: 9901 }

static_resources:
  listeners:
  - name: listener
    address:
      socket_address: { address: 0.0.0.0, port_value: 40000 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: \*
              routes:
              - match: { prefix: / }
                route: { cluster: service_0000 }
          http_filters:
          - name: envoy.router
  clusters:"

for i in `seq -f "%04g" 0 $(($1-1))`; do
echo "
  - name: service_$i
    connect_timeout: 0.25s
    type: STATIC
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    hosts: [{ socket_address: { address: 127.0.0.1, port_value: 60000 }}]"
done
$ sh /tmp/generate-config.sh 1 > /tmp/1_cluster.yaml
$ sh /tmp/generate-config.sh 1001 > /tmp/1001_clusters.yaml

Start Envoy and collect memory stats:

$ cat /tmp/test-overhead.sh
bazel-bin/source/exe/envoy-static --concurrency 1 --disable-hot-restart -c /tmp/1_cluster.yaml >/dev/null 2>&1 &
sleep 30
M1=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

bazel-bin/source/exe/envoy-static --concurrency 1 --disable-hot-restart -c /tmp/1001_clusters.yaml >/dev/null 2>&1 &
sleep 30
M1001=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

bazel-bin/source/exe/envoy-static --concurrency 11 --disable-hot-restart -c /tmp/1001_clusters.yaml >/dev/null 2>&1 &
sleep 30
T1001=`curl -s 127.0.0.1:9901/memory | grep allocated | cut -d\" -f4`
curl -so /dev/null -XPOST 127.0.0.1:9901/quitquitquit
sleep 3

echo "Overhead per cluster: $(((M1001-M1)/1000)) + $(((T1001-M1001)/10000)) for each extra worker thread"
$ sh /tmp/test-overhead.sh
Overhead per cluster: 67439 + 1833 for each extra worker thread

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label May 24, 2019
@stale
Copy link

stale bot commented Jun 23, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 23, 2019
@jmarantz
Copy link
Contributor

/wait

@stale
Copy link

stale bot commented Jul 23, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 23, 2019
@jmarantz
Copy link
Contributor

/wait

hopefully not too much longer :)

@stale
Copy link

stale bot commented Sep 20, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 20, 2019
@stale
Copy link

stale bot commented Sep 28, 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Sep 28, 2019
@jmarantz jmarantz reopened this Sep 28, 2019
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Sep 28, 2019
@jmarantz jmarantz added the no stalebot Disables stalebot from closing an issue label Sep 28, 2019
@mattklein123
Copy link
Member

I'm going to close this as we have made a lot of progress here thanks to @jmarantz. Let's open new issues on specific enhancements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment