-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure APM agent impact on the platform performance #78792
Comments
Pinging @elastic/kibana-platform (Team:Platform) |
SetupAPI performance testing is based on setup https://github.com/dmlemeshko/kibana-load-testing I adjusted number of requests not to overwhelm APM server. setUp(
scn.inject(
constantConcurrentUsers(15) during (2 minute),
rampConcurrentUsers(15) to (20) during (2 minute)
).protocols(httpProtocol)
).maxDuration(15 minutes) Testes are run against 7.10.0-SNAPSHOT ResultsAPM agent seems to add a significant overhead (see 95%). Without APM agent:
With APM agent:
|
Tested Kibana image doesn't contain changes introduced in #78697 |
https://www.elastic.co/guide/en/apm/agent/nodejs/master/performance-tuning.html provides some details on how to squeeze out a bit more perf improvements. Other config values don't seem to affect CPU as much as sampleRation does, so I decided not to use them. @vigneshshanmugam do you have anything to add? |
As you have already figured Perf tuning RUM agent - https://www.elastic.co/guide/en/apm/agent/rum-js/current/performance-tuning.html
I cant seem to find any other config that would help. |
So in summary, even with 'best' compromise configuration, 95th percentile is doubled, and 50th percentile tripled, right? This is... significant. |
@TinaHeiligers you asked how to perform testing: how to run Kibana with APM agent locally:
elastic.apm.active: true
elastic.apm.serverUrl: 'http://127.0.0.1:8200'
# elastic.apm.secretToken: ... <-- might be required in prod/cloud
# optional metrics to adjust performance
# see https://www.elastic.co/guide/en/apm/agent/nodejs/master/configuration.html
elastic.apm.centralConfig: false
elastic.apm.breakdownMetrics: false
elastic.apm.transactionSampleRate: 0.1
elastic.apm.metricsInterval: '120s'
how to run load testing against Kibana:
how to test Kibana on Cloud
|
@restrry I've followed your instructions above and with a little tweaking, was able to run the load tests against a local Kibana instance with and without APM running (through Docker). My setup thus far:
I left the DemoJourney simulation as is regarding requests:
In the screen shots below, I've highlighted the same queries in both cases, for ease of comparison. Full Results: With APM, using the Kibana apm settings suggested in the instructions: Full Results: Summary: |
Looks good overall. The only outlier is the
🚀 |
Progress was slow today, I really struggled to get Kibana 7.10 running and resorted to running Kibana off the distributable. Load tests without APM:
Full results: Load tests with APM: Full results: Summary: |
@dmlemeshko I experienced a similar problem when only the @TinaHeiligers What Cloud settings did you use? There are recommended ones in https://github.com/elastic/kibana-load-testing elasticsearch {
deployment_template = "gcp-io-optimized"
memory = 8192
}
kibana {
memory = 1024
} |
I fixed a login issue for 7.10 when running load testing with new deployment + canvas end-points needed to be updated.
demojourney-20201112111618663.zip 7.10.0.conf deploy config has the same memory values @restrry posted above |
@restrry
I haven't tested on Cloud yet, I'll do that today with the recommended settings. |
@dmlemeshko That's for fixing that issue! I reran the load test on a local Kibana 7.10 distributable and not getting the errors seen previously. Test setup for both runs:
Load tests without APM: Full result Load tests with APM: Full result Summary: |
On Cloud staging, using an existing deployment without APM:Full Result
On cloud staging, using an existing deployment with APM:Test run:
Full results On Cloud staging, creating a deployment as part of the test run:
Script-created deployment
Full Result On Cloud staging, creating a deployment as part of the test run: Not done |
@restrry I've added the results from the Kibana load testing on the cloud (staging) test run where APM is enabled in Kibana. Please let me know if I should repeat the tests with fewer/more concurrent users and/or change any of the APM settings. cc @joshdover |
@TinaHeiligers @restrry I can help with it, but if you are familiar how to add VM the follow up steps are:
|
@dmlemeshko I'm not familiar with adding VM and would greatly appreciate your help! I'm happy to watch you go through the process on Zoom. In the mean time, I'll work through the guide. |
Why we have such a significant difference between |
Here are the steps how to spin up Google Cloud VM and run tests on it: Login to https://console.cloud.google.com/ with corp account Connect to VM, create test folder
In other terminal upload archive to VM
In first terminal (VM) unzip project and start docker container with mapping local/container path, so later you can exit container and keep results on VM
Now you are in container and should be able to see test folder, that contains unzipped project. Run tests as locally
When tests are one, type
From local machine run
Results should be available in the current path |
I think it'd also be worth understanding the difference between 7.11 w/ APM vs 7.10 and 7.9 w/o APM. Due to the many performance tweaks that were made to support Fleet, there may not be large regression in 7.11 w/ APM enabled. If the difference is smaller, enabling this in 7.11 clusters may be an easier pill to swallow. Next, I'd also like to experiment with tweaking some other settings to see if we get any performance improvements:
If none of these result in improved performance, we may need to work directly with the APM team to look at some flamegraphs / profiles and see where most of the time is being spent in the APM agent code. |
That sounds right. Additionally we display stack traces for errors. Not sure if they are disabled too or if that's a different setting 🤔 |
That is a separate feature. The Node.js APM agent always captures a stack trace for a captured |
So we can identify slow operations, but we can't tell why they are so slow? It might be acceptable as long as we can re-configure APM settings and run an instance with |
I'd be interested in A) a higher sample rate and B) enabling breakdown metrics (now that #90955 has been merged). I don't expect breakdown metrics to be useful for Kibana as ~all async operations are Elasticsearch, but just curious about the difference now. |
@restrry might be useful for you folks as well: #90403 |
Sorry for the delay here folks, I had written a lengthy reply but it appears it didn't get posted. So let's try this again 😄
So I'm not the most statistically knowledgeable, but the way I was able to reproducible results was by running the DemoJourney scenario 30 times and then taking the percentiles of the entire distribution of all request timings from all tests. This resulted in about 42k total requests (across about 20 different endpoints) for each configuration. I came up with the 30 number pretty arbitrarily and I suspect we could lower this number in order to speed up the time it takes to get answers back.
Taking another look at the tests I ran before, here is the same table, including the 50p numbers:
We do see an increase in 50p of 26% even with
Yep, we'll need to coordinate with the Cloud folks on how much access we can get in order to flip that switch on to grab some samples when needed. Ideally this would be self-service for Kibana developers (or at least for a handful of teams). If we are able to find a way to optimize this in the agent, then maybe we'd be able to do away with this overhead. @trentm is it possible to offload any of the CPU cycles here to another Node worker thread?
Yep, I don't think that PR was included in the snapshot I ran these tests under. I'll run some more this week to see if it makes an impact. Breakdown metrics would be helpful in some endpoints, but I'm not sure how a higher sample rate would be helpful to us for our use case? |
Mostly just interested in the performance impact of increasing or decreasing the sample rate, compared to the baseline. |
No, the APM agent doesn't currently support using worker threads for any of its work. Worth considering, but not something that would available anytime soon. |
@joshdover Can I get a quick sanity check, please? When I was running DemoJourney against a local Kibana on my laptop I am getting values in the rough range of min=10ms to max=1300ms for the "Global Information" values in the gatling summary, e.g.: Doing a DemoJourney run against a newly deployed 7.12.0-SNAPSHOT I see values in the rough range of min=100ms, 50p=2500ms, max=24000ms, e.g.: Your values are quite a bit higher. I want to make sure we are quoting the same thing.
|
FYI: I'm in the process of removing bluebird #118097 (not sure if that's what's causing these issues, but just in case, I thought I'd let you know). |
I see that the last benchmarking results on this issue are from Nov 2020. |
@dmlemeshko Does something from https://github.com/elastic/kibana-load-testing provide any automatic data here? For example, with the recent #112973 merge, Kibana master has the Node.js APM agent on by default (in its reduced-functionality |
It might be useful to conduct testing against nodejs v16. AFAIK it contains changes to
Yes, you can find it here |
Should we maybe track the automation on an issue (or update this one)? I think this would be helpful to reduce performance implication concerns anyone may have when enabling APM on their cluster. |
I don't know the APM agent testing infrastructure well enough, but I'd be surprised if there is no such performance testing sandbox. cc @trentm and @vigneshshanmugam know better. Let's just keep this effort out of the scope of the current task. Kibana is not the best place for such kind of testing due to its high level of internal complexity. |
The Node.js APM agent does have regular benchmark runs with the data shown here: https://observability-benchmarks.elastic.dev/goto/ec051bde1fc50f0239710a3b5c08867a There was some (timeboxed) work done on closer-to-real-world performance analysis of the Node.js APM agent earlier in elastic/apm-agent-nodejs#2028 That work did not include a regular testing framework. I was somewhat hoping that Kibana usage of the APM agents and https://github.com/elastic/kibana-load-testing might provide a path to getting a feel for APM impact on a large real-world app. However, I might be misunderstanding the goals of kibana-load-testing.git, so my hope is unfair. |
RUM agent benchmarks are also on the same cluster, you can check the RUM dashboard - https://observability-benchmarks.elastic.dev/goto/27dac144459a24fc7e49a461cd81fca9 We have both Micro and Macro benchmarks for the hot paths of the code. However, the macro benchmark does not cover a general application, instead simulates a blank and heavy page. You can read more details about the RUM benchmarking in this document |
@trentm @vigneshshanmugam @mshustov |
I can see the benefits of using Kibana as a real-world scenario for performance testing. But I can see a few problems:
Maybe APM performance testing should belong to https://github.com/elastic/apm-integration-testing?
IMO it's the quickest solution for now. |
Hi everyone. Since we started to use bare metal machine for scalability testing, I decided to double check the impact of APM on Kibana server. Results are available in elastic/kibana-load-testing/issues/221 |
Addressed with #129585 |
We are working on enabling APM agent on the prod build #70497. Before making this happen we want to understand what performance overhead it adds to the Kibana server. We might be able to re-use the setup introduced in #73189 to measure the average response time & number of requests Kibana can handle with and without APM agent enabled.
The text was updated successfully, but these errors were encountered: