Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

Closed
EricDavisX opened this issue Mar 18, 2022 · 18 comments
Labels
blocker bug Fixes for quality problems that affect the customer experience impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:medium Medium Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.2.0

Comments

@EricDavisX
Copy link
Contributor

Kibana version: 8.2.0-SNAPSHOT (details below show the problem seems to have started around March 1 2022)

  • latest commit tested was: ad5c67b

Elasticsearch version:
same as Kibana

Original install method (e.g. download page, yum, from source, etc.):

  • deploiyed via cloud, on cloud-proud CFT region. But problem was initially triaged and seen apart from cloud.
    The issue was first found by the kbn-alert-load Rules Performance tester, by the Response Ops team. I can fwd you a slack message that relates, though I will try to capture the most relevant data here. I can also forward you more details, like the kbn-alert-load jenkins job output that can be re-run on a latest/later-new snapshot easily to re-run the test if we like.

Describe the bug:
The test starts up several cloud clusters, and individually assesses their memory usage and other metrics after creating a variable number of rules a variable % of which can create alerts from their execution.

Screenshots (if relevant):
8 2-kibana-memory-raised

Errors in browser console (if relevant):
n/a

Provide logs and/or server output (if relevant):
Here are the things from slack that we've already gathered:

Questions asked (and answered, mostly):
Is this repeatable or was this a s single run that could be off (I guess not)?
Yes, repeatable with the kbn-alert-load tool and locally

Is that also reproducible locally?
not applicable to the tool's use, but yes others see memory concerns locally
From Brandon K: Anecdotally, I'm seeing the memory usage increase from 8.1 to 8.2 even when there are no alerting rules running. The following are heap usages when starting up a new Kibana instance and logging in:
8.1: 403 MB
8.2: 574 MB

For an RSS increase, the first thing that would be interesting to understand whether it’s an increase in native memory or heap memory.
Heap dumps from Brandon K are here:
https://drive.google.com/drive/folders/1n0jjJ_H3oEbViMYjd8eKKqqcwM2mbcOX?usp=sharing

NOTE: we suspected possibly a node version? But it does not seem to be the Node version btw, 8.0 uses Node 16.13.2 (see https://github.com/elastic/kibana/blob/8.0/.node-version) as well as main (https://github.com/elastic/kibana/blob/main/.node-version).

Patrick M offered up a script he had used to create many running rules locally, if helpful:
https://gist.github.com/pmuellr/f30331660ae032a0a7ccf2767aea3900

@EricDavisX EricDavisX added bug Fixes for quality problems that affect the customer experience performance labels Mar 18, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Mar 18, 2022
@EricDavisX
Copy link
Contributor Author

I can copy in some additional research as posted in slack, if desired. But I could let Tyler, Mikhail, and Response Ops team members cite their own latest. @kobelb @mshustov @danielmitterdorfer fyi

@EricDavisX
Copy link
Contributor Author

From Jonathan B:
i can't speak for all of ci metrics - but for package testing we first noticed this on march 3rd. this doesn't collect metrics but there's some useful info in the build history

Jonathan Budzenski 3 days ago
Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.109+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.uptime.alerts
Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.351+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .preview.alerts-security.alerts
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- Last few GCs --->
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 28312 ms: Mark-sweep 477.4 (509.6) -> 475.3 (508.1) MB, 673.1 / 0.1 ms (average mu = 0.121, current mu = 0.008) allocation failure GC in old space requested
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 29291 ms: Mark-sweep 481.1 (509.8) -> 475.9 (508.6) MB, 968.9 / 0.1 ms (average mu = 0.062, current mu = 0.011) allocation failure scavenge might not succeed
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- JS stacktrace --

Jonathan Budzenski 3 days ago
https://buildkite.com/elastic/kibana-package-testing/builds/180#292b4f9b-a267-4d30-a768-211992b9407c

Jonathan Budzenski 3 days ago
coincides with #124816
#124816 [Security Solution] Initial version of the Cloud Security Posture app…
Summary
The initial version of the Cloud Security Posture application, hidden behind an advanced setting.
Issue: #124806
Labels
release_note:skip, backport:skip, v8.2.0, Team:Cloud Security Posture
Comments
5
https://github.com/[elastic/kibana](https://github.com/elastic/kibana)|elastic/kibanaelastic/kibana | Feb 7th | Added by GitHub

Jonathan Budzenski 3 days ago
i'll see what happens if i revert

@EricDavisX
Copy link
Contributor Author

From Mikhail Shustov, Tyler and Patrick:
@tylersmalley Do you know why APM metrics collected on CI on the 8.1.0 branch have a gap starting from 08 March?
I wonder why the 8.1 snapshot (latest) in our test didn’t seem to follow that pattern
Maybe a workload during your testing is different from the workload performed at CI? Can you see the difference in the memory size before running your test suit?
👀

Patrick Mueller:four_leaf_clover: 1 day ago
wrt the Rx/status stuff I noticed in the heap snapshots, I also noticed this PR got merged two weeks ago: #126320 - and poking around a bit, also this PR - #126334. And I’ve recently noticed the status messages getting a little twitchy - got this when using curl to create a new connector and then a new rule using that connector:
[... 14:22:20.123][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': alerting
[... 14:22:20.374][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': alerting
[... 14:22:21.965][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': licensing
[... 14:22:22.395][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': licensing
[... 14:22:35.487][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': fleet
[... 14:22:35.737][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': fleet
#126320 [Status service] log plugin status changes
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub
#126334 PluginsStatusService: reuse plugin status observables
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub

Patrick Mueller:four_leaf_clover: 1 day ago
Maybe a workload during your testing is different from the workload performed at CI?
Ya, it’s certainly different. So one possibility is our workload is now (indirectly) pulling more code/processing in that it did previously. Had the previous versions also touched that code that it’s (presumably) now touching that it wasn’t before, we wouldn’t see the difference. But the bulk of the difference I saw in the heap snapshots was mostly Observable related.

Tyler Smalley:spiral_calendar_pad: 1 day ago
@mikhail.shustov, that's when 8.1.0 was released so the 8.1 branch was bumped to 8.1.1. To make it easier, you could compare labels.branch .
👍:skin-tone-2:

@EricDavisX
Copy link
Contributor Author

@jbudz did you find anything when you did the revert you were trying?

@mshustov
Copy link
Contributor

You can find a couple of spikes on the RSS chart of APM data collected on Kibana CI (around the end of January and 14.03)
2022-03-18_15-25-03
You can find APM data here

@jbudz
Copy link
Member

jbudz commented Mar 18, 2022

@jbudz did you find anything when you did the revert you were trying?

No that wasn't the root cause. Still had a similar memory profile.

@jbudz jbudz added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Mar 18, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Mar 18, 2022
@jbudz jbudz added the blocker label Mar 18, 2022
@jbudz
Copy link
Member

jbudz commented Mar 18, 2022

I added the blocker label for 8.2.0. Kibana's pretty close to OOM on a fresh cloud installation.

@mshustov
Copy link
Contributor

mshustov commented Mar 21, 2022

I checked the memory snapshot with #126320 to be reverted. There are still a lot of Observables objects.
Status service seems to be the root cause of the problem. I made a heap snapshot on a #128096 with the dev version of Kibana.

As you can see, there are quite a few RxJS created.
v8.2
2022-03-21_18-14-44
vs. #12809
2022-03-21_18-14-56
There are too many MapSubscirber instances to analyze their origin, but lots of the SwitchMapSubscriber instances point to PluginsStatusService.

2022-03-21_18-21-44

I noticed that this logic (with an observable creation) is called ~22000 times on start https://github.com/elastic/kibana/blob/main/src/core/server/status/plugins_status.ts#L132-L142
I'm not familiar with the service well enough. Could someone from the @elastic/kibana-core take a look at the problem, please?
If you are interested in analyzing Heap dump, you can use built-in nodejs functionality https://nodejs.org/api/cli.html#--heapsnapshot-signalsignal

@mshustov
Copy link
Contributor

Reverting #126320 appears to fix this.

Maybe we can merge #128096? to check whether RSS is back to the norm on APM data gathered from CI https://kibana-ci-apm.kb.us-central1.gcp.cloud.es.io:9243/goto/ac99b0f0-a93f-11ec-94a3-2d0a38b40710

@jbudz
Copy link
Member

jbudz commented Mar 21, 2022

I pulled #128096 out of draft and added core as a reviewer - whatever's easiest to help investigate. Feel free to merge.

@mshustov
Copy link
Contributor

I pulled #128096 out of draft and added core as a reviewer

The PR has been merged. I can see the drop in APM metrics, but it doesn't seem we are back to the norm.
3mon. trend
2022-03-22_11-51-52

2d. trend
2022-03-22_11-54-28

@EricDavisX
Copy link
Contributor Author

EricDavisX commented Mar 22, 2022

[Edited] - I misposted prior, sorry for any confusion. I'm not sure how to test this pre-merge. But I will be watching for afterwards.

@mshustov
Copy link
Contributor

@gsoldevila is working on improving plugin status service memory footprint. I will test his PR and repeat heap snapshot analysis

@exalate-issue-sync exalate-issue-sync bot added impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:medium Medium Level of Effort labels Mar 24, 2022
@mshustov
Copy link
Contributor

mshustov commented Mar 28, 2022

@kobelb Having #128324 merged, we can remove the blocker status. With the PR merged, the size of the consumed RSS memory was reduced by almost half, event loop delay improved by nearly 2.5 times.
A week dashboard
2022-03-28_09-20-02

According to APM metrics collected at CI, memory usage metrics had rolled back to their all-time lows since January 12, when APM integration on CI was introduced.
3 months dashboard
2022-03-28_09-24-49
2022-03-28_09-25-03

@EricDavisX
Copy link
Contributor Author

I'm among others I'm sure, but I'm waiting for... a success platform-build to get a new 8.2 snapshot and then a nightly run of the kbn-alert-load tool will pick up the change and show us the improvement (again). Very nice work!

@afharo afharo mentioned this issue Mar 31, 2022
9 tasks
@lukeelmers
Copy link
Member

We believe the root cause has been addressed in #128324, so we will go ahead and close this issue.

@EricDavisX Feel free to reopen or open a new issue if you find there is something we have missed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:medium Medium Level of Effort performance Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v8.2.0
Projects
None yet
Development

No branches or pull requests

5 participants