Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Performance testing tool for alerting need to have the ability to create/clean ecctl Kibana deployment. #121457

Closed
YulNaumenko opened this issue Dec 16, 2021 · 14 comments
Labels
impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. Team:Operations Team label for Operations Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@YulNaumenko
Copy link
Contributor

YulNaumenko commented Dec 16, 2021

Describe the feature:
Based on the requirements for kbn-alert-load performance testing tool the next configuration/deployment abilities is needed:

  1. Create a special test account set up at https://cloud.elastic.co?
  2. Create an API key at the cloud site for use with ecctl
  3. Deploy Kibana to ecctl - https://www.elastic.co/guide/en/ecctl/current/ecctl-installing.html when the testing job will be run
  4. This installation should have the config for ecctl with ecctl init, with the proper API key

Describe a specific use case for the feature:
We are planning to make this tool as a part of x-pack/tests/performance/alerting. The initial idea was to use Buildkite to run that performance testing job every night.
In addition we want to setup a slack channel to post failures of this job (ideally once issue #1 is done, the only failures would be regression test failures when the performance of a build fails to meet the specified metric). Ensure that all deployments are cleaned up at the end of the job, regardless of whether the job succeeds or fails.

cc: @ymao1 and @pmuellr

@YulNaumenko YulNaumenko added Team:Operations Team label for Operations Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Dec 16, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@ymao1
Copy link
Contributor

ymao1 commented Dec 16, 2021

Thanks for opening the issue @YulNaumenko! The requirements for this may change as we evaluate whether we can leverage the QA environment in ML for automated performance testing. I will update this issue as we learn more but for right now, no immediate action is needed.

@tylersmalley
Copy link
Contributor

Thanks for opening the issue. A few questions:

Is there a timeline you're looking for to have this up and running?
Is the expectation that it uses a snapshot of Kibana in the Cloud-First Testing region?

@pmuellr
Copy link
Member

pmuellr commented Dec 16, 2021

So many QA environments to (potentially) choose from! :-) :elastic-heart:

Some thoughts on this, since I haven't given it a whole lot of thought before.

If we want to use kbn-alert-load, or something similar to it, then what we'll actually want is just a plain old node runtime env. We could even pull the source for whatever we're going to run, do the yarn install there, and then run the node app. That node app would use ecctl (or make the equivalent http calls to the cloud api endpoint) to deploy the stacks for testing.

In order to double-check that we don't leave orphan deployments (if the node app dies), I think we'd want a job that just checks for old deployments and complains however it complains that we can see it, so we can delete them manually for now. That job runs maybe once an hour.

We would certainly be wanting to test against a snapshot of some kind - I've only tested BC versions in the Belgium gcp region in recent history - and never had good luck testing snapshots ever. But maybe that was just bad timing. And we'll also be testing against older versions as well.

If we can work with snapshots, what's the frequency of those?

@pmuellr
Copy link
Member

pmuellr commented Dec 16, 2021

If we want to use kbn-alert-load, or something similar to it, then what we'll actually want is just a plain old node runtime env. We could even pull the source for whatever we're going to run, do the yarn install there, and then run the node app. That node app would use ecctl (or make the equivalent http calls to the cloud api endpoint) to deploy the stacks for testing.

Since we're talking about where to move kbn-alert-load to (right now it's here: https://github.com/pmuellr/kbn-alert-load), I happened to think that it may be more cumbersome to have this code in the Kibana repo, compared to having it in it's own small repo. Especially for purposes like this, where we'd have to have a Kibana build or check Kibana out of git, just to run the tool :-)

@tylersmalley
Copy link
Contributor

Thanks for the additional information.

Currently, @jbudz is working on rolling out manual Cloud-First Testing to PR's, which I believe we could leverage pieces of for this. With that, using the label ci:deploy-cloud will build the Cloud docker image and deploy it to the gcp-us-west2 region. Subsequent pushes to the branch will re-build and update the existing deployment. The deployment will be deleted at a specified time of no updates (~7-14 days), or when the PR is merged/closed.

This is what I am thinking for the daily pipeline based on my understanding of your needs:

  • Ensure there is no cluster named alert-load-testing, if so, delete it.
  • Build HEAD of main and deploy to gcp-us-west2 with name alert-load-testing
  • Download/checkout kbn-alert-load tool and run against the Cloud cluster. (location of tool subject to change)
  • Post any failures to a Slack channel (yet to be determined)

@ymao1
Copy link
Contributor

ymao1 commented Dec 17, 2021

@tylersmalley That sounds great and lines up with what we were hoping to have as well! We would want to run different scenarios with different configurations of rules and possibly different kibana configurations. Do you think it would make more sense to have a separate daily pipeline for each scenario (that deploys a separate cluster for each scenario) or try to serialize the scenarios into one pipeline?

@tylersmalley
Copy link
Contributor

tylersmalley commented Dec 17, 2021

@ymao1 do you have an idea for how long the tests would run for each configuration? If it's not much, I would say try to serialize them as there is overhead to brining up the cluster (~10 minutes). But if it's maybe more than 5-10 minutes I think creating a separate pipeline would make sense. It's really no difficult to do in Buildkite. The discussion is mostly around resources, and the overall job time.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Feb 16, 2022
@tylersmalley tylersmalley removed loe:small Small Level of Effort impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. labels Mar 16, 2022
@tylersmalley
Copy link
Contributor

tylersmalley commented Mar 22, 2022

With Cloud-First Testing being live now. Where do you stand with your need for this? Do you need assistance with creating a Buildkite pipeline?

@exalate-issue-sync exalate-issue-sync bot added the impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. label Mar 22, 2022
@pmuellr
Copy link
Member

pmuellr commented Mar 22, 2022

@EricDavisX - this is a slightly old issue we opened to look at automating some kbn-alert-load runs. I'm thinking the requirements outlined in the top comment are being / will be handled by the work you're currently doing? So we can probably close this?

@EricDavisX
Copy link
Contributor

Hello - Yes, I think we can close this. I can update where we are tracking pending work to support Rules oriented performance tests and where the jobs are now:

we've enhanced the jenkins run to always delete the ecctl deploys, so note the kbn-alert-load tool still suffers that ailment

other enhancements are on the way and are being tracked here #119845

  • there is an internal 'qa' issue if anyone wants i can dm that.

@pmuellr
Copy link
Member

pmuellr commented Mar 23, 2022

we've enhanced the jenkins run to always delete the ecctl deploys, so note the kbn-alert-load tool still suffers that ailment

Probably not a bad idea to bake this into kbn-alert-load at this point. I think we'd want a signal handler to catch ctrl-c, etc terminations, which would then delete the deployments on an "unclean" exit as well (or at least most "unclean" exits).

I think the reason I had this in there was so I could debug the deployments if there were some issues, while I was developing this.

We could think of adding this as an option (--keep-deployments-on-error, or such), to support cases like that in the future, but probably YAGNI.

@EricDavisX
Copy link
Contributor

it's a good idea- I put a ticket in to the kbn-alert-load project for it Patrick. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. Team:Operations Team label for Operations Team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants