Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regularly benchmarking and stress-testing the alerting framework and rule types #119845

Open
13 of 24 tasks
mikecote opened this issue Nov 29, 2021 · 8 comments
Open
13 of 24 tasks
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Nov 29, 2021

The alerting system must be regularly benchmarked and stress-tested before every production release. Preferably mirroring known complex customer environments. This ensures we do not introduce any regressions by benchmarking and comparing key health metrics.

There are various ongoing performance testing & framework / tool creation efforts that relate to Kibana, some research has been done to ensure the pros/cons and applicability of each so we can invest where we see the best value proposition balanced with quickest roi we can get. As research continues it seems clear we'll plan to extend one or more tools or frameworks into a given solution. So, while we may start with one tool as an incremental first-step or as a starting point, we're developing this to a set of requirements, foremost.

Front-runner for starting-point tool/library: The Kibana Alerting team / ResponseOps kbn-alert-load Alert / Rule testing tool

  • It is known that this repo is forked and under current usage/dev by several Security side team members, we will research and sync on the current state / capability.
    ... see below for options that were declined for now.

Here are some of the WIP Requirements we are evaluating and building out:

  • enables team to catch *some types of performance regressions within 24 hours merging
  • To be modular wrt to the distinct execution elements:
  • - cluster creation or attachment-to-existing-cluster
  • - - Spin up needed envs of specific config sizing in a viable cloud service that facilitates a performance / stress-test
  • - - allow to connect to a self-managed cluster facilitates better speed of assessing a developer change locally
  • data-load / continuing ingest (at varying scales - what data do we need here, what tool to use to generate the data!)
  • test-setup options: execute a configurable number of set API calls, looped and parameterized (like creating rules)
  • option to allow Kibana / cluster to run indefinitely or for a set amount of time in minutes (latter is currently hardcoded
  • Continuous monitoring and capture of desired metrics:
  • - start and end time of the Rule Executions
  • - metrics to evaluate their potential drift
  • - overall Kibana memory usage / cpu usage stats
  • - overall cluster Kibana health stat (somewill be in event log, cluster health will not be, need to itemize this)
  • - overall health of Rules execution (none are failed unintentionally)
  • integrated with a CI system for nightly (if not more often) run (prototype done in jenkins, not kibana buildkite, fyi)
  • slack channel output of results from test run assessment at end of ci (selectable slack channel)
  • Entails an automated pas/fail assessment of performance (relative comparison or fixed data points?, including health + errors review)
  • - automated assessment must be left as optional, allowing other teams to incrementally adopt usage
  • - option to enable during test api-calls (and a pass/fail metric on if they remain over x threshold of perf)
  • - review of Kibana log for unexpected errors (a grep + pass/fail mark)
  • option to perform or NOT any env clean-up (this is the default, but requirement relates to re-using environments)

Stretch / next goals:

  • Confirm/enable tool to allow testing over different Rule type needs (some WIP by Security team)
  • Confirm/enable tool to allow testing over Cases needs
  • Confirm/enable tool to allow testing over one or more connector 3rd party needs (bulk updates etc)
    • focus on email connector next?

FYI: Frameworks/Tools that have been researched and ruled out for immediate purposes:

  1. Kibana-QA team created an API load testing tool - kibana-load-testing. It was researched by Patrick M in 2020 and Alert/Rules team did not end up collaborating on it, it uses the Kibana HTTP API and so isn't best suited to assess the (background process) Task Manager at the moment

  2. Kibana Working group's coming tool - (including folks like Spencer A / Tyler S / Daniel M / Liza K - they are discussing and working on a performance testing tool and CI integration for Kibana needs.

  • Eric is bringing requirements / context and generally participating with the Kib Perf Working group (v2) to benefit both groups.
  • Their timeline is cited as TBD for when Kibana Task Manager centric automation support will be focused, the UI is where they are investing first (as of Feb 2022). This is partly done knowing that kbn-alert-load tool exists and is sufficient for teams (based on its usasge).
@mikecote mikecote added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Nov 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote added the Meta label Dec 1, 2021
@mikecote mikecote removed their assignment Dec 2, 2021
@alexfrancoeur
Copy link

alexfrancoeur commented Dec 6, 2021

Dropping this in here, but if we aren't already talking to the rally team, we may be able to use the dataset from these upcoming tracks: elastic/rally-tracks#222, elastic/apm-server#6731

@mikecote
Copy link
Contributor Author

mikecote commented Jan 17, 2022

I will remove this issue (and assignees) from our iteration plan for now, as we would like for @EricDavisX to pick this up in the coming weeks with the research that is done so far.

@mikecote mikecote moved this from In Progress to Todo in AppEx: ResponseOps - Execution & Connectors Jan 17, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@EricDavisX EricDavisX self-assigned this Feb 3, 2022
@EricDavisX
Copy link
Contributor

EricDavisX commented Feb 7, 2022

I'm researching this and hoping to finish evaluating what usage the ResponseOps and Security side teams have done in the next few days. With that done I'll be able to come up with a list of requirements and then also a modest plan for what I'll do next/further here.

@EricDavisX
Copy link
Contributor

Still researching the kbn-alert-load tool - thanks all for the help. Also Finishing a first draft of a requirements document that QA will assess (with Engineering too) - then we'll form a plan and adjust the bullet points above

@EricDavisX
Copy link
Contributor

MLR-QA team is wrapping up a prototype jenkins job to run kbn-alert-load tool (while security team has a prototype done in build-kite, fyi!) - I'll post details in slack for RespOps team

@EricDavisX
Copy link
Contributor

I can update where we are. we did a proof of concept in jenkins and have decided to continue iterating on it from the machine-learning-qa-infra jenkins server:

we've enhanced the jenkins run to always delete the ecctl deploys. we'll continue updating this periodically with progress.

@EricDavisX EricDavisX removed their assignment May 6, 2022
@EricDavisX
Copy link
Contributor

We have achieved an MVP that includes the checked metrics above, it runs nightly against several versions via cloud (CFT region) and reports pass/fail into our slack channel - I'm going to focus on other work, though may help drive QA implementing a few small remaining low-hanging fruit items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

8 participants