Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting + Task Manager] Benchmarking 7.14 #95194

Closed
gmmorris opened this issue Mar 23, 2021 · 6 comments
Closed

[Alerting + Task Manager] Benchmarking 7.14 #95194

gmmorris opened this issue Mar 23, 2021 · 6 comments
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Mar 23, 2021

Following the release of 7.14 and with it the fresh perf work we've done we decided we need to do the following:

  1. Run fresh and comprehensive performance tests on the 7.14 release
  2. Use the results of the perf test as a basis for a sizing your Kibana cluster for alerting blogpost, similar to https://www.elastic.co/blog/benchmarking-and-sizing-your-elasticsearch-cluster-for-logs-and-metrics

note: 2021-08-09: we split item 2 off to a separate issue: [alerting] public facing doc on alerting performance #107979

@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Mar 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor Author

gmmorris commented Apr 7, 2021

Following the latest RAC leads sync (held 6th April), I think we should bump this to a later point so we can include Alerts-as-Data in the benchmarks.

This does raise some questions about how we want to measure our API performance.
Should we add timing measurements for the various APIs that rely on Alerts-as-data?
Such as the APIs fetching alerts for the Alert table?

@gmmorris gmmorris changed the title [Alerting + Task Manager] Benchmarking 7.12 and blogpost [Alerting + Task Manager] Benchmarking 7.13 and blogpost May 4, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented May 4, 2021

It looks like ES have been able to address the issue we identified in UpdateByQuery: elastic/elasticsearch#63671

We expect this to address the degradation we've identified when there are more than 30 Kibana instances or so, so it's something we should validate at part of this issue.

@gmmorris gmmorris changed the title [Alerting + Task Manager] Benchmarking 7.13 and blogpost [Alerting + Task Manager] Benchmarking 7.14 and blogpost Jul 1, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented Jul 1, 2021

Bumped to 7.14 😬

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 6, 2021
@gmmorris gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021
@pmuellr pmuellr self-assigned this Aug 3, 2021
@pmuellr pmuellr changed the title [Alerting + Task Manager] Benchmarking 7.14 and blogpost [Alerting + Task Manager] Benchmarking 7.14 Aug 9, 2021
@pmuellr
Copy link
Member

pmuellr commented Aug 9, 2021

The stress tester needed some updates, as ecctl has changed a bit since the last time we ran the tester, and a few enhancements have been added:

commits:

changes:

  • added 7.14 as the top-level version
  • changed running alerts at 1m interval, to 3s interval; similar
    change to decrease the number of rules we actually create;
    thinking is, we can simulate more alerts by running them at a
    smaller interval
  • increased wait for deployments to finish creation from 5 minutes to 10
  • changed how kibana config is set (change to ecctl data shape)
  • now finds closest match to platform's choice of RAM usage

@pmuellr
Copy link
Member

pmuellr commented Aug 9, 2021

Here are three runs I made with the tester; I didn't see any significant regressions, nor any significant increase in performance - as expected

@pmuellr pmuellr closed this as completed Aug 9, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

4 participants