Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(k8s): disable weekly triggers #9928

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vponomaryov
Copy link
Contributor

@vponomaryov vponomaryov commented Jan 28, 2025

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@vponomaryov vponomaryov added the backport/none Backport is not required label Jan 28, 2025
@vponomaryov
Copy link
Contributor Author

The K8S CI jobs have been failing for months already.
See https://argus.scylladb.com/dashboard/operator-master
It doesn't get any attention from anyone.

So, disable weekly K8S triggers to stop wasting resources/money.

@ylebi
Copy link

ylebi commented Jan 28, 2025

@mflendrich please advice if you want to stop those for now.

@ylebi ylebi requested a review from mflendrich January 28, 2025 15:46
@mflendrich
Copy link

@grzywin your opinion is needed here.

As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.

There are two options:

  1. unbreak this job right away if it's a small effort and close this PR
  2. merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

@pehala
Copy link
Contributor

pehala commented Jan 29, 2025

unbreak this job right away if it's a small effort and close this PR

Currently, I am not aware of anyone taking care of that job, even if we unbreak it, we need someone to monitor it frequently

@mflendrich
Copy link

we need someone to monitor it frequently

Is it possible to get failure alerts to #team-operator-ci on Slack? That is the team's workflow that would require zero proactive monitoring on top of what we do already.

@zimnx
Copy link

zimnx commented Jan 29, 2025

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

@mflendrich
Copy link

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

Thanks for bringing this up, sounds concerning to me. Let's see what it takes to unbreak the red job and have the monitoring discussion separately.

@grzywin
Copy link
Contributor

grzywin commented Jan 30, 2025

@grzywin your opinion is needed here.

As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.

There are two options:

  1. unbreak this job right away if it's a small effort and close this PR
  2. merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

@mflendrich In the past, we already had a discussion about the benefits of having tests in two different places (the SCT framework and the Scylla Operator project itself). The link to the meeting and the Excel file describing what we have in SCT can be found here.

The conclusion was that many of the tests in SCT are already covered in Scylla Operator, and we don't want to maintain both. The important ones that are not covered in Scylla Operator are the Upgrade tests (which I trigger manually before each Operator release), and as you already mentioned, SCT also runs tests in AWS, which we don’t have in Scylla Operator.

Taking the above into account, and considering the decision that I will be working on tests directly in the Scylla Operator project, I opt for option 2. To ensure AWS testing is not overlooked, after closing this issue:

  • I might work on adding AWS Upgrade tests to the weekly triggers so that we can monitor both AWS and upgrade tests regularly and simultaneously
  • We can create a task (if there isn’t one already) in Scylla Operator to add E2E tests in AWS and work on this in the near future.

Either way, I am approving this PR.

Copy link
Contributor

@grzywin grzywin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vponomaryov
Copy link
Contributor Author

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

Yes, actively developed project has such side-effects that new changes may break some stuff.
Jobs are separated based on SCT branches.
Scylla-operator didn't have it's own because it was easier to keep master working than bother with backports.
It is not a problem to create scylla-operator branch(es) if it is needed.
Sure, the maintenance takes some time, but you won't be ignored if you need to do something in SCT.

JFYI: guys who test scylla-manager do use SCT test coverage, they didn't drop it after ownership switch.

@grzywin your opinion is needed here.
As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.
There are two options:

  1. unbreak this job right away if it's a small effort and close this PR
  2. merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

@mflendrich In the past, we already had a discussion about the benefits of having tests in two different places (the SCT framework and the Scylla Operator project itself). The link to the meeting and the Excel file describing what we have in SCT can be found here.

It doesn't tell anything about the value and complexity of redoing elsewhere.

The conclusion was that many of the tests in SCT are already covered in Scylla Operator, and we don't want to maintain both.

High load?
High load plus controlled disruptions?
Multi-dc setup with high load and controlled disruptions?
Multi-tenant setups with controlled disruptions?
Performance?
All of the above on AWS/EKS and GCE/GKE platforms?

Are these covered in operator tests?

The important ones that are not covered in Scylla Operator are the Upgrade tests (which I trigger manually before each Operator release), and as you already mentioned, SCT also runs tests in AWS, which we don’t have in Scylla Operator.

Upgrade tests SCT has are of the following 3 types:

  • K8S platform upgrade
  • Scylla upgrade
  • Scylla operator upgrade

It is not enough to trigger for release only, need to keep a finger on the pulse of it's workability.

Taking the above into account, and considering the decision that I will be working on tests directly in the Scylla Operator project, I opt for option 2. To ensure AWS testing is not overlooked, after closing this issue:

  • I might work on adding AWS Upgrade tests to the weekly triggers so that we can monitor both AWS and upgrade tests regularly and simultaneously
  • We can create a task (if there isn’t one already) in Scylla Operator to add E2E tests in AWS and work on this in the near future.

Either way, I am approving this PR.

Dropping SCT tests the existing scylla-operator test coverage gets cut significantly from the quality perspective point of view.
It is stepping back for about 2-3 years.

This particular PR is only about making sense for now - do not waste money while it is not used.
I still believe it is a mistake to drop the SCT test coverage which is unique.

Copy link

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, based on the context above, operating under an assumption that these tests provide substantial value that we shouldn't forgo: I think we should stick with the original plan of keeping this coverage as-is, only unbreaking now [1] vs later [2].

@grzywin can you assess the amount of work required to unbreak the job (as we discussed yesterday)?

@grzywin
Copy link
Contributor

grzywin commented Jan 30, 2025

@mflendrich I will take a look on those failing jobs today and get back to you.

@ylebi As a side note, we should keep in mind that I am the only tester working on the Operator right now. As a result, in future it might be difficult to keep the SCT tests stable at all times while simultaneously working on tests in Go/scylla-operator repo.

@mflendrich
Copy link

@ylebi As a side note, we should keep in mind that I am the only tester working on the Operator right now. As a result, in future it might be difficult to keep the SCT tests stable at all times while simultaneously working on tests in Go/scylla-operator repo.

We should have this conversation next week - as per my earlier comment to figure out how to spend our resources best.

To me it looks like a planning exercise we'll need to conduct (whether building new coverage should come at the cost of sacrificing existing coverage in the SCTs).

Let's now focus on the technicalities of what it takes to fix these jobs now. After the weekend I'll start a discussion and we'll decide: which existing tests we want to invest to keep supporting, what we want to drop, and where we'll be adding new coverage. I hope that this approach can bring some clarity and address everyone's concerns here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants