-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add load limit presets for larger container sizes #1857
Comments
<< Because the Fleet Server is currently sharing the container with APM and other processes (filebeat, etc), particular attention must be paid to peak running capacity of all processes. Fleet Server cannot be allowed to starve out APM, for example, particularly in regards to RAM. |
After discussing with @joshdover we should take a look at the scale testing we did and probably came out with the following additional presets limits:
|
As discussed on a debugging session, there should be a file for 100k as well. @joshdover It is not clear from the description what we want to set as the new limits.
Is there any other setting we want to change other than the ram and number of agents? |
This should be covered by the max_limits file.
Yes we need to define this clearly. @ablnk has been using some of the settings recommended over in https://github.com/elastic/observability-perf/issues/208#issuecomment-1263675432. Originally these were intended for larger instance sizes, but I think we've been using them on smaller sizes too. I think we still need to have a product decision about how much resource we need to make available for APM Server vs. Fleet Server and how to make this trade-off clear to the user in the documentation. @ablnk can you share exactly which parameters you are using, your Integrations Server sizing, and what scale you're testing with? |
@joshdover this can be checked on configs tab in the spreadsheet. |
Blocked because waiting for testing: #2043 (comment) |
@pjbertels @joshdover what is left here that prevent us moving this forward? |
Several weeks ago we discussed running a systematic series of tests to determine what these limits should be at each container size. This is an attempt to codify that for discussion so we can start executing these tests. GoalFleet Server's "max agents" setting and the corresponding sizing documentation have a hard requirement of satisfying the following:
These settings should also aim to optimize the following parameters:
Lastly, we need to have some amount of breathing room / buffer so that future changes to Fleet Server do not force us to reduce the number of agents that a given container size can serve or force us to increase the rate limits by a significant amount, which would result in slower rollouts. ProcessThis is essentially an optimization problem and we will need to be diligent about how we go about testing configuration changes to ensure that we meet the hard requirements, produce the most optimal parameters, and ensure the chosen parameters are consistently reliable for the given target. We need to run a separate series of tests at each container size that follows this sort of process:
When adjusting the parameters in step (6), use your common sense. If you see that checkins are rate limited the most, you may want to adjust that first without adjusting acks, and vice versa. We'll then need to repeat this process for each container size. It's very important that our results are recorded diligently so that we can easily determine what other things we could try to tune performance. |
@joshdover @kpollich pushing it out to sprint to one of our next sprint because as stated recently we first need to fix the long polling problem. |
@joshdover @ablnk the long polling problem is now fixed on staging, could we perform those tests then? |
@jlind23 I'm ready to perform that, but waiting for confirmation that we good to go with all of the used parameters https://github.com/elastic/ingest-dev/issues/1360#issuecomment-1362998446 |
@ablnk are you talking about those below?
Then yes. cc @joshdover to double down on it. |
@joshdover I'm wondering if there are any technical reasons why we should Starting from gt10000_limits.yml it is already 500us, so there is not much room for maneuver. I conducted a few tests with 15k agents and I always hit rate limits errors by route of acks. Tried increasing server.limits.ack_limit.max and server.limits.ack_limit.burst by 20%, but didn't notice any significant difference. Btw, you can track testing progress here: https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk Please take a look, perhaps we need to tune performance in a different way. |
We discussed this today and came to some conclusions (though I've filled in a bit more detail based on my own thoughts):
|
Assign this to @ablnk as the remaining work is on his end. |
Conducted series of tests with 50k agents on a single 8 GB Integrations Server instance. Results are promising: https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk , looks like it can handle that load not being restarted / hit max memory. Testing conducted with data streaming enabled. The initial test run has been performed with the following Fleet Server parameters: https://github.com/elastic/fleet-server/blob/main/internal/pkg/config/defaults/max_limits.yml For the second test run I decreased acks limit interval to 250us plus increased cache limits by 20%, that eventually resulted in better performance - all of the AC's completed faster. |
@ablnk does it mean that as a next step Julia can take your google doc and update the config file according to the parameters you used then merge it? |
As max_limits.yml is the default for the range of 30k-100k, let me do a few more tests with 75k and 100k to ensure it works at all scales |
Through experimentation, I found out that two 8 GB Integrations Server instances can handle 100k agents load (including data streaming & APM instrumentation) without hitting max memory / restarting. The most resource-demanding operation is enrollment, but its performance is highly dependent on resources available for Elasticsearch. The more vCPU's and RAM Elasticsearch has, the faster enrollment is performed (note that I always use "CPU optimized" template when I create deployments), whilst all other operations take about the same time to complete, regardless of increased resources for Elasticsearch. E.g. if you compare two deployments, one with 384 GB RAM | 192 vCPU for Elasticsearch and 16 GB RAM | 8 vCPU for Integrations Server, another with 512 GB RAM | 256 vCPU for Elasticsearch and same resources for Integrations Server, you'll find that enrollment of 100k performed ~30% faster on the second one, however all other bulk actions took the same time. Wherein, adding more resources to Integrations Server only does not improve the speed of bulk actions execution. I tried to adjust enroll limit setting by decreasing interval and increasing burst/max, but it didn't bring any performance gains. You can get acquainted with the test results in more detail here https://docs.google.com/spreadsheets/d/1hPeMVOS9YPxdo1SOGEqxVEnxtgOCiHapRNZiwRQWCZk/edit#gid=0 Note that agent metrics were not available for some reason at the time of testing (they used to before) in the us-central1 region, I was only able to monitor Fleet Server memory usage in APM. Regarding Fleet Server settings, did a comparison of settings I use:
versus setting from that PR (on the same deployment). Enrollment went way faster (there were fewer enroll rate limited requests) with these, though all other operations took about the same time to complete. |
This makes 100% sense, and glad to see it confirmed in the data. ES is the bottleneck because enrollment requires that Elasticsearch generates 2 new API keys per Agent, which requires running CPU-bound crypto computations. Increasing the number of ES nodes will improve the overall throughput of generating keys, while increasing FS resources will not have any impact. I would be curious to see if we could minimize the ES resources required for normal day-to-day operations, which is unlikely to ever involve needing to quickly enroll 100k agents at once. I'm also curious if instead of increasing the hot tier capacity, we could get the same improvements in enrollment by adding more coordinating-only nodes, which is likely to be much cheaper than adding hot tier capacity. That said, I think this is out of scope for this issue. All of this said -- I think we have not used the right setting names for the Fleet Server configurations 😢 I just noticed that the naming in the limit preset files in Fleet Server's codebase don't match the names we support in the actual configuration. Notably, we use I also don't think we should compare the APM memory measurements to the metrics we were using previously as they are likely measured slightly differently. We need to have metrics restored on Staging to be able to make an apples-to-apples comparison here. Next steps & learnings:
|
That explains why I didn't notice any difference when adjusted some settings |
We met about this earlier today and were able to test that with the correct config names, Fleet Server logs out the updated configuration. Next we will first retest on 50k agents on a single 8gb instance, and if successful will test 100k on 2 x 8gb instances. |
We are blocked on being able to view metrics by https://github.com/elastic/cloud/issues/111572 |
@ablnk the issue has been closed and I've confirmed metrics are working again |
@joshdover @jlind23 ok, so I conducted testing with the correct setting names |
It seems the longer long polling is having very large positive impact on scalability of Fleet Server which explains why even the default settings are performing well with 50k agents per 8gb container. I lean towards actually not making any changes and closing this issue for now and revisiting this when we want to reach larger scale targets. The existing presets will work well for on-prem deployments who don't have their proxy configured correctly for longer polling intervals and we can optimize this in our managed offering without making changes to the on-prem presets. @jlind23 WDYT? |
@joshdover Agreed, we are spending way too much time without being to obtain drastically different results. |
@joshdover @jlind23 here are results of testing 100k on 2 x 8gb instances. In this case, using the settings had a positive effect (on the contrary to the test of 50k). Test run 1
Enrollment took 78 mins Test run 2
Enrollment took 37 mins |
We need to be able to support larger container sizes in Cloud for Fleet Server and have it scale appropriately to maximize its resource utilization. This allows more agents to connect to each instance which well help improve the effectiveness of caching in Fleet Server.
We currently store configuration presets in https://github.com/elastic/fleet-server/tree/main/internal/pkg/config/defaults which max out at 30k agents on a 32gb container. I think we need to make a few adjustments:
One thing that is blocking this is #1841 which prevents us from overriding the presets so we can't experiment with the parameters to determine what the new presets should be.
The text was updated successfully, but these errors were encountered: