[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

EricDavisX · 2022-03-18T12:50:49Z

Kibana version: 8.2.0-SNAPSHOT (details below show the problem seems to have started around March 1 2022)

latest commit tested was: ad5c67b

Elasticsearch version:
same as Kibana

Original install method (e.g. download page, yum, from source, etc.):

deploiyed via cloud, on cloud-proud CFT region. But problem was initially triaged and seen apart from cloud.
The issue was first found by the kbn-alert-load Rules Performance tester, by the Response Ops team. I can fwd you a slack message that relates, though I will try to capture the most relevant data here. I can also forward you more details, like the kbn-alert-load jenkins job output that can be re-run on a latest/later-new snapshot easily to re-run the test if we like.

Describe the bug:
The test starts up several cloud clusters, and individually assesses their memory usage and other metrics after creating a variable number of rules a variable % of which can create alerts from their execution.

Screenshots (if relevant):

Errors in browser console (if relevant):
n/a

Provide logs and/or server output (if relevant):
Here are the things from slack that we've already gathered:

Questions asked (and answered, mostly):
Is this repeatable or was this a s single run that could be off (I guess not)?
Yes, repeatable with the kbn-alert-load tool and locally

Is that also reproducible locally?
not applicable to the tool's use, but yes others see memory concerns locally
From Brandon K: Anecdotally, I'm seeing the memory usage increase from 8.1 to 8.2 even when there are no alerting rules running. The following are heap usages when starting up a new Kibana instance and logging in:
8.1: 403 MB
8.2: 574 MB

For an RSS increase, the first thing that would be interesting to understand whether it’s an increase in native memory or heap memory.
Heap dumps from Brandon K are here:
https://drive.google.com/drive/folders/1n0jjJ_H3oEbViMYjd8eKKqqcwM2mbcOX?usp=sharing

NOTE: we suspected possibly a node version? But it does not seem to be the Node version btw, 8.0 uses Node 16.13.2 (see https://github.com/elastic/kibana/blob/8.0/.node-version) as well as main (https://github.com/elastic/kibana/blob/main/.node-version).

Patrick M offered up a script he had used to create many running rules locally, if helpful:
https://gist.github.com/pmuellr/f30331660ae032a0a7ccf2767aea3900

The text was updated successfully, but these errors were encountered:

EricDavisX · 2022-03-18T12:53:04Z

I can copy in some additional research as posted in slack, if desired. But I could let Tyler, Mikhail, and Response Ops team members cite their own latest. @kobelb @mshustov @danielmitterdorfer fyi

EricDavisX · 2022-03-18T13:25:16Z

From Jonathan B:
i can't speak for all of ci metrics - but for package testing we first noticed this on march 3rd. this doesn't collect metrics but there's some useful info in the build history

Jonathan Budzenski 3 days ago
Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.109+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.uptime.alerts
Mar 03 19:57:57 packer-virtualbox-iso-1561908076 kibana[2313]: [2022-03-03T19:57:57.351+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .preview.alerts-security.alerts
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- Last few GCs --->
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 28312 ms: Mark-sweep 477.4 (509.6) -> 475.3 (508.1) MB, 673.1 / 0.1 ms (average mu = 0.121, current mu = 0.008) allocation failure GC in old space requested
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: [2313:0x5170bb0] 29291 ms: Mark-sweep 481.1 (509.8) -> 475.9 (508.6) MB, 968.9 / 0.1 ms (average mu = 0.062, current mu = 0.011) allocation failure scavenge might not succeed
Mar 03 19:58:03 packer-virtualbox-iso-1561908076 kibana[2313]: <--- JS stacktrace --

Jonathan Budzenski 3 days ago
https://buildkite.com/elastic/kibana-package-testing/builds/180#292b4f9b-a267-4d30-a768-211992b9407c

Jonathan Budzenski 3 days ago
coincides with #124816
#124816 [Security Solution] Initial version of the Cloud Security Posture app…
Summary
The initial version of the Cloud Security Posture application, hidden behind an advanced setting.
Issue: #124806
Labels
release_note:skip, backport:skip, v8.2.0, Team:Cloud Security Posture
Comments
5
https://github.com/[elastic/kibana](https://github.com/elastic/kibana)|elastic/kibanaelastic/kibana | Feb 7th | Added by GitHub

Jonathan Budzenski 3 days ago
i'll see what happens if i revert

EricDavisX · 2022-03-18T13:26:10Z

From Mikhail Shustov, Tyler and Patrick:
@tylersmalley Do you know why APM metrics collected on CI on the 8.1.0 branch have a gap starting from 08 March?
I wonder why the 8.1 snapshot (latest) in our test didn’t seem to follow that pattern
Maybe a workload during your testing is different from the workload performed at CI? Can you see the difference in the memory size before running your test suit?
👀

Patrick Mueller:four_leaf_clover: 1 day ago
wrt the Rx/status stuff I noticed in the heap snapshots, I also noticed this PR got merged two weeks ago: #126320 - and poking around a bit, also this PR - #126334. And I’ve recently noticed the status messages getting a little twitchy - got this when using curl to create a new connector and then a new rule using that connector:
[... 14:22:20.123][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': alerting
[... 14:22:20.374][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': alerting
[... 14:22:21.965][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': licensing
[... 14:22:22.395][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': licensing
[... 14:22:35.487][INFO ][status] 1 plugins changed status from 'available' to 'unavailable': fleet
[... 14:22:35.737][INFO ][status] 1 plugins changed status from 'unavailable' to 'available': fleet
#126320 [Status service] log plugin status changes
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub
#126334 PluginsStatusService: reuse plugin status observables
https://github.com/elastic/kibana|elastic/kibanaelastic/kibana | Feb 24th | Added by GitHub

Patrick Mueller:four_leaf_clover: 1 day ago
Maybe a workload during your testing is different from the workload performed at CI?
Ya, it’s certainly different. So one possibility is our workload is now (indirectly) pulling more code/processing in that it did previously. Had the previous versions also touched that code that it’s (presumably) now touching that it wasn’t before, we wouldn’t see the difference. But the bulk of the difference I saw in the heap snapshots was mostly Observable related.

Tyler Smalley:spiral_calendar_pad: 1 day ago
@mikhail.shustov, that's when 8.1.0 was released so the 8.1 branch was bumped to 8.1.1. To make it easier, you could compare labels.branch .
👍:skin-tone-2:

EricDavisX · 2022-03-18T13:26:41Z

@jbudz did you find anything when you did the revert you were trying?

mshustov · 2022-03-18T14:29:30Z

You can find a couple of spikes on the RSS chart of APM data collected on Kibana CI (around the end of January and 14.03)

You can find APM data here

jbudz · 2022-03-18T17:57:41Z

@jbudz did you find anything when you did the revert you were trying?

No that wasn't the root cause. Still had a similar memory profile.

jbudz · 2022-03-18T20:20:27Z

Reverting #126320 appears to fix this.

Now: #128095
https://kibana-pr-128095.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances

After reverting: #128096
https://kibana-pr-128096.kb.us-west2.gcp.elastic-cloud.com:9243/app/monitoring#/kibana/instances

elasticmachine · 2022-03-18T20:20:55Z

Pinging @elastic/kibana-core (Team:Core)

jbudz · 2022-03-18T20:25:37Z

I added the blocker label for 8.2.0. Kibana's pretty close to OOM on a fresh cloud installation.

mshustov · 2022-03-21T17:33:30Z

I checked the memory snapshot with #126320 to be reverted. There are still a lot of Observables objects.
Status service seems to be the root cause of the problem. I made a heap snapshot on a #128096 with the dev version of Kibana.

As you can see, there are quite a few RxJS created.
v8.2

vs. #12809

There are too many MapSubscirber instances to analyze their origin, but lots of the SwitchMapSubscriber instances point to PluginsStatusService.

I noticed that this logic (with an observable creation) is called ~22000 times on start https://github.com/elastic/kibana/blob/main/src/core/server/status/plugins_status.ts#L132-L142
I'm not familiar with the service well enough. Could someone from the @elastic/kibana-core take a look at the problem, please?
If you are interested in analyzing Heap dump, you can use built-in nodejs functionality https://nodejs.org/api/cli.html#--heapsnapshot-signalsignal

mshustov · 2022-03-21T17:53:20Z

Reverting #126320 appears to fix this.

Maybe we can merge #128096? to check whether RSS is back to the norm on APM data gathered from CI https://kibana-ci-apm.kb.us-central1.gcp.cloud.es.io:9243/goto/ac99b0f0-a93f-11ec-94a3-2d0a38b40710

jbudz · 2022-03-21T18:00:21Z

I pulled #128096 out of draft and added core as a reviewer - whatever's easiest to help investigate. Feel free to merge.

mshustov · 2022-03-22T10:55:08Z

I pulled #128096 out of draft and added core as a reviewer

The PR has been merged. I can see the drop in APM metrics, but it doesn't seem we are back to the norm.
3mon. trend

2d. trend

EricDavisX · 2022-03-22T12:46:20Z

[Edited] - I misposted prior, sorry for any confusion. I'm not sure how to test this pre-merge. But I will be watching for afterwards.

mshustov · 2022-03-22T17:05:20Z

@gsoldevila is working on improving plugin status service memory footprint. I will test his PR and repeat heap snapshot analysis

mshustov · 2022-03-28T07:32:30Z

@kobelb Having #128324 merged, we can remove the blocker status. With the PR merged, the size of the consumed RSS memory was reduced by almost half, event loop delay improved by nearly 2.5 times.
A week dashboard

According to APM metrics collected at CI, memory usage metrics had rolled back to their all-time lows since January 12, when APM integration on CI was introduced.
3 months dashboard

EricDavisX · 2022-03-29T17:14:31Z

I'm among others I'm sure, but I'm waiting for... a success platform-build to get a new 8.2 snapshot and then a nightly run of the kbn-alert-load tool will pick up the change and show us the improvement (again). Very nice work!

lukeelmers · 2022-04-05T14:19:20Z

We believe the root cause has been addressed in #128324, so we will go ahead and close this issue.

@EricDavisX Feel free to reopen or open a new issue if you find there is something we have missed!

EricDavisX added bug Fixes for quality problems that affect the customer experience performance labels Mar 18, 2022

botelastic bot added the needs-team Issues missing a team label label Mar 18, 2022

EricDavisX added the v8.2.0 label Mar 18, 2022

jbudz added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Mar 18, 2022

botelastic bot removed the needs-team Issues missing a team label label Mar 18, 2022

jbudz added the blocker label Mar 18, 2022

jbudz mentioned this issue Mar 21, 2022

Revert pr/126320 #128096

Merged

gsoldevila mentioned this issue Mar 22, 2022

Rewrite plugin_status logic limiting usage of Observables (reducing heap size) #128324

Merged

exalate-issue-sync bot added impact:critical This issue should be addressed immediately due to a critical level of impact on the product. loe:medium Medium Level of Effort labels Mar 24, 2022

afharo mentioned this issue Mar 31, 2022

Upgrade RxJS to 7 #129087

Merged

9 tasks

lukeelmers closed this as completed Apr 5, 2022

EricDavisX mentioned this issue May 18, 2022

Kibana RSS memory spike seen in performance test #131950

Closed

rudolf mentioned this issue May 23, 2022

Performance tests for memory regressions #132686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

mshustov commented Mar 18, 2022

jbudz commented Mar 18, 2022 •

edited

Loading

jbudz commented Mar 18, 2022 •

edited

Loading

elasticmachine commented Mar 18, 2022

jbudz commented Mar 18, 2022

mshustov commented Mar 21, 2022 •

edited

Loading

mshustov commented Mar 21, 2022

jbudz commented Mar 21, 2022

mshustov commented Mar 22, 2022

EricDavisX commented Mar 22, 2022 •

edited

Loading

mshustov commented Mar 22, 2022

mshustov commented Mar 28, 2022 •

edited

Loading

EricDavisX commented Mar 29, 2022

lukeelmers commented Apr 5, 2022

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

[Performance] Kibana RSS / Memory usage seems higher in 8.2 vs previous releases (by 50% sometimes) #128061

Comments

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

EricDavisX commented Mar 18, 2022

mshustov commented Mar 18, 2022

jbudz commented Mar 18, 2022 • edited Loading

jbudz commented Mar 18, 2022 • edited Loading

elasticmachine commented Mar 18, 2022

jbudz commented Mar 18, 2022

mshustov commented Mar 21, 2022 • edited Loading

mshustov commented Mar 21, 2022

jbudz commented Mar 21, 2022

mshustov commented Mar 22, 2022

EricDavisX commented Mar 22, 2022 • edited Loading

mshustov commented Mar 22, 2022

mshustov commented Mar 28, 2022 • edited Loading

EricDavisX commented Mar 29, 2022

lukeelmers commented Apr 5, 2022

jbudz commented Mar 18, 2022 •

edited

Loading

jbudz commented Mar 18, 2022 •

edited

Loading

mshustov commented Mar 21, 2022 •

edited

Loading

EricDavisX commented Mar 22, 2022 •

edited

Loading

mshustov commented Mar 28, 2022 •

edited

Loading