Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load limit presets for larger container sizes #1857

Closed
joshdover opened this issue Sep 13, 2022 · 28 comments
Closed

Add load limit presets for larger container sizes #1857

joshdover opened this issue Sep 13, 2022 · 28 comments
Assignees

Comments

@joshdover
Copy link
Contributor

We need to be able to support larger container sizes in Cloud for Fleet Server and have it scale appropriately to maximize its resource utilization. This allows more agents to connect to each instance which well help improve the effectiveness of caching in Fleet Server.

We currently store configuration presets in https://github.com/elastic/fleet-server/tree/main/internal/pkg/config/defaults which max out at 30k agents on a 32gb container. I think we need to make a few adjustments:

  • The limits for the 32gb container are likely too low as we're seeing in some early test results (see private issue linked below) that Fleet Server utilization is not being maxed out on 16gb or 32gb containers with the max agents configured.
  • We need to be able to support up to 64gb container sizes and have appropriate presets for these.

One thing that is blocking this is #1841 which prevents us from overriding the presets so we can't experiment with the parameters to determine what the new presets should be.

@scunningham
Copy link

<<
The limits for the 32gb container are likely too low as we're seeing in some early test results (see private issue linked below) that Fleet Server utilization is not being maxed out on 16gb or 32gb containers with the max agents configured.

Because the Fleet Server is currently sharing the container with APM and other processes (filebeat, etc), particular attention must be paid to peak running capacity of all processes. Fleet Server cannot be allowed to starve out APM, for example, particularly in regards to RAM.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Oct 25, 2022
@jlind23 jlind23 added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent Label for the Agent team labels Oct 27, 2022
@jlind23
Copy link
Contributor

jlind23 commented Oct 27, 2022

After discussing with @joshdover we should take a look at the scale testing we did and probably came out with the following additional presets limits:

  • New file: gt25000_limits
  • New file: gt50000_limits
  • New file: gt75000_limits
  • Updated file: maxlimits needs to be scale up in order to support 100K Elastic Agent.

@juliaElastic juliaElastic self-assigned this Oct 28, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Oct 28, 2022

As discussed on a debugging session, there should be a file for 100k as well.
After changing the limits, make generate should be run.

@joshdover It is not clear from the description what we want to set as the new limits.
Currently we have:

recommended_min_ram num_agents
0 0-50
1024 50-5000
2048 5000-7500
4096 7500-10000
8192 10000-12500
16384 12500-30000
32768 30000+

Is there any other setting we want to change other than the ram and number of agents?

@joshdover
Copy link
Contributor Author

As discussed on a debugging session, there should be a file for 100k as well.

This should be covered by the max_limits file.

@joshdover It is not clear from the description what we want to set as the new limits.

Yes we need to define this clearly. @ablnk has been using some of the settings recommended over in https://github.com/elastic/observability-perf/issues/208#issuecomment-1263675432. Originally these were intended for larger instance sizes, but I think we've been using them on smaller sizes too. I think we still need to have a product decision about how much resource we need to make available for APM Server vs. Fleet Server and how to make this trade-off clear to the user in the documentation.

@ablnk can you share exactly which parameters you are using, your Integrations Server sizing, and what scale you're testing with?

@ablnk
Copy link

ablnk commented Oct 31, 2022

@joshdover this can be checked on configs tab in the spreadsheet.
Though I used settings for 50k from that comment, I applied them for 100k, since 30gb Integrations Server in enough to have everything working as expected. For 50k & 75k I used 24gb Integrations Server

@juliaElastic
Copy link
Contributor

Blocked because waiting for testing: #2043 (comment)

@jlind23
Copy link
Contributor

jlind23 commented Nov 23, 2022

@pjbertels @joshdover what is left here that prevent us moving this forward?

@joshdover
Copy link
Contributor Author

joshdover commented Dec 6, 2022

Several weeks ago we discussed running a systematic series of tests to determine what these limits should be at each container size. This is an attempt to codify that for discussion so we can start executing these tests.

Goal

Fleet Server's "max agents" setting and the corresponding sizing documentation have a hard requirement of satisfying the following:

  1. Fleet Server never crashes due to an OOM for the given container size
  2. Some minimal amount of APM traces can also be accepted by this deployment without crashing
    • For a minimum, I suggest enabling tracing of Fleet Server itself with 5% sample rate

These settings should also aim to optimize the following parameters:

  1. Minimal time required to rollout a policy change and have all agents ack it
  2. Maximal number of agents can be served by the given container size
  3. Agents do not go offline due to rate limits during a policy change
    • This is likely the most controversial requirement
    • This would requires that agents don't get rate limited for more than 5 mins
    • I would consider this lowest priority of the requirements, as this could also be mitigated in the UI by reporting status of a rollout

Lastly, we need to have some amount of breathing room / buffer so that future changes to Fleet Server do not force us to reduce the number of agents that a given container size can serve or force us to increase the rate limits by a significant amount, which would result in slower rollouts.

Process

This is essentially an optimization problem and we will need to be diligent about how we go about testing configuration changes to ensure that we meet the hard requirements, produce the most optimal parameters, and ensure the chosen parameters are consistently reliable for the given target.

We need to run a separate series of tests at each container size that follows this sort of process:

  1. Choose a container size, create an ES cluster large enough not to be a bottleneck (recommend starting with 64gb of hot capacity)
  2. Start with the presets and the number of agents in the documentation for this container size. Do not set "max agents".
  3. Enable APM tracing on Fleet Server by adding this to the Fleet Server custom yaml config in Fleet UI:
    server.instrumentation:
      enabled: true
      hosts:
      - <APM server host>
      secret_token: <APM secret token>
      transaction_sample_rate: 0.05
    
  4. Run the perf suite for the number of agents in the docs
  5. When complete, take note of
    • Did any of the ACs fail?
    • Did the Integrations Server container restart or hit max memory?
    • How long did each AC take to complete?
    • Did APM Server reject any traces?
  6. Prepare the parameters for the next test and re-run
    • Right now we're trying to improve rollout times, not agent counts. So we'll modify the rate limits accordingly.
    • If any of the hard requirements were not met, you are done and the previous test should be the new limits
    • If the hard requirements were met, we can run a new test with modified rate limits for checkin and ack. We are not optimizing artifact and enrolls at this time.
      • server.limits.checkin_limit.interval - reduce by 10%, never go below 500us
      • server.limits.checkin_limit.burst - increase by 20%
      • server.limits.checkin_limit.max - increase by 20%
      • server.limits.ack_limit.interval - reduce by 10%, never go below 500us
      • server.limits.ack_limit.burst - increase by 20%
      • server.limits.ack_limit.max - increase by 20%
    • Run Steps 1-6 again

When adjusting the parameters in step (6), use your common sense. If you see that checkins are rate limited the most, you may want to adjust that first without adjusting acks, and vice versa.

We'll then need to repeat this process for each container size. It's very important that our results are recorded diligently so that we can easily determine what other things we could try to tune performance.

@jlind23
Copy link
Contributor

jlind23 commented Dec 7, 2022

@joshdover @kpollich pushing it out to sprint to one of our next sprint because as stated recently we first need to fix the long polling problem.

@jlind23
Copy link
Contributor

jlind23 commented Jan 4, 2023

@joshdover @ablnk the long polling problem is now fixed on staging, could we perform those tests then?

@ablnk
Copy link

ablnk commented Jan 4, 2023

@jlind23 I'm ready to perform that, but waiting for confirmation that we good to go with all of the used parameters https://github.com/elastic/ingest-dev/issues/1360#issuecomment-1362998446

@jlind23
Copy link
Contributor

jlind23 commented Jan 4, 2023

@ablnk are you talking about those below?

server.timeouts.checkin_long_poll: 28m
server.timeouts.write: 29m

Then yes. cc @joshdover to double down on it.

@ablnk
Copy link

ablnk commented Jan 10, 2023

@joshdover I'm wondering if there are any technical reasons why we should never go below 500us for server.limits.ack_limit.interval ? We're currently using 250us as a precondition for automated test runs with 50k+ agents, using the following params https://github.com/elastic/observability-perf/issues/208#issuecomment-1263675432

Starting from gt10000_limits.yml it is already 500us, so there is not much room for maneuver. I conducted a few tests with 15k agents and I always hit rate limits errors by route of acks. Tried increasing server.limits.ack_limit.max and server.limits.ack_limit.burst by 20%, but didn't notice any significant difference.

Btw, you can track testing progress here: https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk

Please take a look, perhaps we need to tune performance in a different way.

@joshdover
Copy link
Contributor Author

We discussed this today and came to some conclusions (though I've filled in a bit more detail based on my own thoughts):

  • We do not need to retain nearly as much memory space for APM as previously thought. This is based on discussions I had with @simitt a few weeks ago.
  • We can and should decrease the interval if the rate limits are still too aggressive and memory is not being used well. Note that this may increase CPU consumption as well so we'll want to see if we're getting bottlenecked there.
  • When testing larger number of agents on smaller amounts of RAM, we should keep on eye on the "API key authz hit rate" graph in the Fleet Scaling dashboard. If this is lower than 80% when using a single container, it's likely that the cache_limits need to be increased. We should aim to hit 80%+.
  • To scope down this issue, we're going to focus on what can we achieve with a single 8gb Fleet instance, with a 128gb ES cluster. We should try to get 50k agents working on a single instance reliably. This will certainly requiring increasing the cache limits.
  • If we are not able to get 50k agents working on a single container, we will determine what the maximum this size can support and then focus on the larger containers being made available in https://github.com/elastic/cloud/issues/86611 Note: these are currently only available in QA - we should wait until they're in at least staging.

@jlind23
Copy link
Contributor

jlind23 commented Jan 13, 2023

Assign this to @ablnk as the remaining work is on his end.

@jlind23 jlind23 assigned ablnk and unassigned juliaElastic Jan 13, 2023
@ablnk
Copy link

ablnk commented Jan 17, 2023

Conducted series of tests with 50k agents on a single 8 GB Integrations Server instance. Results are promising: https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk , looks like it can handle that load not being restarted / hit max memory. Testing conducted with data streaming enabled. The initial test run has been performed with the following Fleet Server parameters: https://github.com/elastic/fleet-server/blob/main/internal/pkg/config/defaults/max_limits.yml For the second test run I decreased acks limit interval to 250us plus increased cache limits by 20%, that eventually resulted in better performance - all of the AC's completed faster.

@jlind23
Copy link
Contributor

jlind23 commented Jan 17, 2023

@ablnk does it mean that as a next step Julia can take your google doc and update the config file according to the parameters you used then merge it?

@ablnk
Copy link

ablnk commented Jan 17, 2023

As max_limits.yml is the default for the range of 30k-100k, let me do a few more tests with 75k and 100k to ensure it works at all scales

@ablnk
Copy link

ablnk commented Jan 20, 2023

Through experimentation, I found out that two 8 GB Integrations Server instances can handle 100k agents load (including data streaming & APM instrumentation) without hitting max memory / restarting. The most resource-demanding operation is enrollment, but its performance is highly dependent on resources available for Elasticsearch. The more vCPU's and RAM Elasticsearch has, the faster enrollment is performed (note that I always use "CPU optimized" template when I create deployments), whilst all other operations take about the same time to complete, regardless of increased resources for Elasticsearch. E.g. if you compare two deployments, one with 384 GB RAM | 192 vCPU for Elasticsearch and 16 GB RAM | 8 vCPU for Integrations Server, another with 512 GB RAM | 256 vCPU for Elasticsearch and same resources for Integrations Server, you'll find that enrollment of 100k performed ~30% faster on the second one, however all other bulk actions took the same time. Wherein, adding more resources to Integrations Server only does not improve the speed of bulk actions execution. I tried to adjust enroll limit setting by decreasing interval and increasing burst/max, but it didn't bring any performance gains. You can get acquainted with the test results in more detail here https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk/edit#gid=0 Note that agent metrics were not available for some reason at the time of testing (they used to before) in the us-central1 region, I was only able to monitor Fleet Server memory usage in APM.

Regarding Fleet Server settings, did a comparison of settings I use:

server.runtime:
  gc_percent: 20
server.timeouts.checkin_long_poll: 28m
server.timeouts.write: 29m
cache_limits:
  num_counters: 384000
  max_cost: 251658240
server_limits:
  policy_throttle: 5ms
  max_connections: 32000
  checkin_limit:
    interval: 500us
    burst: 4000
    max: 25001
  artifact_limit: 
    interval: 500us
    burst: 4000
    max: 8000
  enroll_limit: 
    interval: 10ms
    burst: 100
    max: 200
  ack_limit:
    interval: 250us
    burst: 4000
    max: 8000

versus setting from that PR (on the same deployment). Enrollment went way faster (there were fewer enroll rate limited requests) with these, though all other operations took about the same time to complete.

@joshdover
Copy link
Contributor Author

The most resource-demanding operation is enrollment, but its performance is highly dependent on resources available for Elasticsearch.

This makes 100% sense, and glad to see it confirmed in the data. ES is the bottleneck because enrollment requires that Elasticsearch generates 2 new API keys per Agent, which requires running CPU-bound crypto computations. Increasing the number of ES nodes will improve the overall throughput of generating keys, while increasing FS resources will not have any impact.

I would be curious to see if we could minimize the ES resources required for normal day-to-day operations, which is unlikely to ever involve needing to quickly enroll 100k agents at once. I'm also curious if instead of increasing the hot tier capacity, we could get the same improvements in enrollment by adding more coordinating-only nodes, which is likely to be much cheaper than adding hot tier capacity. That said, I think this is out of scope for this issue.

All of this said -- I think we have not used the right setting names for the Fleet Server configurations 😢 I just noticed that the naming in the limit preset files in Fleet Server's codebase don't match the names we support in the actual configuration. Notably, we use server_limits in the limit preset files, but the raw config is names server.limits. I think the presets still work when configuring "max agents", but this means you can't copy/paste settings configs from the presets into the Fleet Server policy config to override without changing the naming. I think this invalidates some of the testing done here, as I suspect the server_limits block was completely ignored.

I also don't think we should compare the APM memory measurements to the metrics we were using previously as they are likely measured slightly differently. We need to have metrics restored on Staging to be able to make an apples-to-apples comparison here.

Next steps & learnings:

  • We should validate Fleet Server's current configuration before executing a test. This should already be logged after the changes in Log redacted config on config changes #1671
  • We need metrics to be re-enabled on 8.6, I believe this may be working again on the 8.6.1 BC already, so we could do our testing there.
  • After validating what settings the 8gb container needs for 50k, we need to update https://github.com/elastic/fleet-server/pull/2043/files correctly (there's several things there that aren't quite right).
  • We also need to run tests on the 16gb instance sizes once available (not yet)

@ablnk
Copy link

ablnk commented Jan 23, 2023

I suspect the server_limits block was completely ignored

That explains why I didn't notice any difference when adjusted some settings

@joshdover
Copy link
Contributor Author

We met about this earlier today and were able to test that with the correct config names, Fleet Server logs out the updated configuration.

Next we will first retest on 50k agents on a single 8gb instance, and if successful will test 100k on 2 x 8gb instances.

@joshdover
Copy link
Contributor Author

We are blocked on being able to view metrics by https://github.com/elastic/cloud/issues/111572

@joshdover
Copy link
Contributor Author

@ablnk the issue has been closed and I've confirmed metrics are working again

@ablnk
Copy link

ablnk commented Jan 25, 2023

@joshdover @jlind23 ok, so I conducted testing with the correct setting names server.limits and cache.limits of 50k agents on a single 8gb instance. Details here (see the last three 50k runs) https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk
Summary: I wasn't able to enroll agents, I made several attempts with the identical conditions to ensure. On each of my attempts, I found that after 30 minutes agents stuck up with 32-35k agents enrolled without progressing, and on the contrary, some of the agents went offline. For the next test run I used settings with which I got good results before (set of settings that had server_limits and cache_limits). I got good results again, enrollment completed in 24 mins. However, it turned out that blocks server_limits and cache_limits are ignored in this case (as Josh mentioned) and server configuration changes to default values (I verified that via "server configuration has changed" log, where I found settings such as new.Inputs.Server.Limits.AckLimit.Interval set to a different value than were actually set in Fleet Server settings in UI). I attached that log to the comment so you can verify settings with which we can get good results.
server configuration has changed.txt

@joshdover
Copy link
Contributor Author

It seems the longer long polling is having very large positive impact on scalability of Fleet Server which explains why even the default settings are performing well with 50k agents per 8gb container.

I lean towards actually not making any changes and closing this issue for now and revisiting this when we want to reach larger scale targets. The existing presets will work well for on-prem deployments who don't have their proxy configured correctly for longer polling intervals and we can optimize this in our managed offering without making changes to the on-prem presets.

@jlind23 WDYT?

@jlind23
Copy link
Contributor

jlind23 commented Jan 25, 2023

@joshdover Agreed, we are spending way too much time without being to obtain drastically different results.
Thus I believe we should rather leave it as is and resurrect it if it becomes a priority again.

@jlind23 jlind23 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 25, 2023
@ablnk
Copy link

ablnk commented Jan 27, 2023

@joshdover @jlind23 here are results of testing 100k on 2 x 8gb instances. In this case, using the settings had a positive effect (on the contrary to the test of 50k).

Test run 1
ES 512 GB | Integrations Server 16 GB 16 vCPU
Fleet Server settings:

server.runtime:
  gc_percent: 20          
server.timeouts.checkin_long_poll: 28m
server.timeouts.write: 29m
server.instrumentation:
  enabled: true
  hosts:
  - <>
  secret_token: <>
  transaction_sample_rate: 0.05

Enrollment took 78 mins

Test run 2
ES 512 GB | Integrations Server 16 GB 16 vCPU
Fleet Server settings:

server.runtime:
  gc_percent: 20          
server.timeouts.checkin_long_poll: 28m
server.timeouts.write: 29m
cache:
  num_counters: 640000
  max_cost: 419430400
server.limits:
  checkin_limit:
    interval: 250us
    burst: 8000
    max: 55000
  artifact_limit:
    interval: 250us
    burst: 8000
    max: 16000
  ack_limit:
    interval: 250us
    burst: 8000
    max: 16000
  enroll_limit:
    interval: 10ms
    burst: 200
    max: 400
server.instrumentation:
  enabled: true
  hosts:
  - <>
  secret_token: <>
  transaction_sample_rate: 0.05

Enrollment took 37 mins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants