ci(k8s): disable weekly triggers #9928

vponomaryov · 2025-01-28T11:21:43Z

Testing

[ ]

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

vponomaryov · 2025-01-28T11:23:49Z

The K8S CI jobs have been failing for months already.
See https://argus.scylladb.com/dashboard/operator-master
It doesn't get any attention from anyone.

So, disable weekly K8S triggers to stop wasting resources/money.

ylebi · 2025-01-28T15:46:42Z

@mflendrich please advice if you want to stop those for now.

mflendrich · 2025-01-29T13:49:45Z

@grzywin your opinion is needed here.

As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.

There are two options:

unbreak this job right away if it's a small effort and close this PR
merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

pehala · 2025-01-29T13:57:16Z

unbreak this job right away if it's a small effort and close this PR

Currently, I am not aware of anyone taking care of that job, even if we unbreak it, we need someone to monitor it frequently

mflendrich · 2025-01-29T14:00:28Z

we need someone to monitor it frequently

Is it possible to get failure alerts to #team-operator-ci on Slack? That is the team's workflow that would require zero proactive monitoring on top of what we do already.

zimnx · 2025-01-29T14:06:11Z

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

mflendrich · 2025-01-29T14:11:51Z

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

Thanks for bringing this up, sounds concerning to me. Let's see what it takes to unbreak the red job and have the monitoring discussion separately.

grzywin · 2025-01-30T08:52:56Z

@grzywin your opinion is needed here.

As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.

There are two options:

unbreak this job right away if it's a small effort and close this PR

merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

@mflendrich In the past, we already had a discussion about the benefits of having tests in two different places (the SCT framework and the Scylla Operator project itself). The link to the meeting and the Excel file describing what we have in SCT can be found here.

The conclusion was that many of the tests in SCT are already covered in Scylla Operator, and we don't want to maintain both. The important ones that are not covered in Scylla Operator are the Upgrade tests (which I trigger manually before each Operator release), and as you already mentioned, SCT also runs tests in AWS, which we don’t have in Scylla Operator.

Taking the above into account, and considering the decision that I will be working on tests directly in the Scylla Operator project, I opt for option 2. To ensure AWS testing is not overlooked, after closing this issue:

I might work on adding AWS Upgrade tests to the weekly triggers so that we can monitor both AWS and upgrade tests regularly and simultaneously
We can create a task (if there isn’t one already) in Scylla Operator to add E2E tests in AWS and work on this in the near future.

Either way, I am approving this PR.

grzywin

lgtm

vponomaryov · 2025-01-30T10:02:42Z

We won't be able to monitor these failures. QA team doesn't version their jobs and libraries, so any merged PR can break those.

Yes, actively developed project has such side-effects that new changes may break some stuff.
Jobs are separated based on SCT branches.
Scylla-operator didn't have it's own because it was easier to keep master working than bother with backports.
It is not a problem to create scylla-operator branch(es) if it is needed.
Sure, the maintenance takes some time, but you won't be ignored if you need to do something in SCT.

JFYI: guys who test scylla-manager do use SCT test coverage, they didn't drop it after ownership switch.

@grzywin your opinion is needed here.
As I discussed with @zimnx today, these tests have coverage (e.g. AWS) that we don't have elsewhere -- by leaving those tests either red or disabled we are missing important coverage.
There are two options:

unbreak this job right away if it's a small effort and close this PR

merge this PR - disable this job (because there's no point in burning resources on red tests over and over) - and create an issue for a more elaborate fix.

@grzywin - can you look into this and recommend whether we do (1) or (2)?

@mflendrich In the past, we already had a discussion about the benefits of having tests in two different places (the SCT framework and the Scylla Operator project itself). The link to the meeting and the Excel file describing what we have in SCT can be found here.

It doesn't tell anything about the value and complexity of redoing elsewhere.

The conclusion was that many of the tests in SCT are already covered in Scylla Operator, and we don't want to maintain both.

High load?
High load plus controlled disruptions?
Multi-dc setup with high load and controlled disruptions?
Multi-tenant setups with controlled disruptions?
Performance?
All of the above on AWS/EKS and GCE/GKE platforms?

Are these covered in operator tests?

The important ones that are not covered in Scylla Operator are the Upgrade tests (which I trigger manually before each Operator release), and as you already mentioned, SCT also runs tests in AWS, which we don’t have in Scylla Operator.

Upgrade tests SCT has are of the following 3 types:

K8S platform upgrade
Scylla upgrade
Scylla operator upgrade

It is not enough to trigger for release only, need to keep a finger on the pulse of it's workability.

Taking the above into account, and considering the decision that I will be working on tests directly in the Scylla Operator project, I opt for option 2. To ensure AWS testing is not overlooked, after closing this issue:

I might work on adding AWS Upgrade tests to the weekly triggers so that we can monitor both AWS and upgrade tests regularly and simultaneously

We can create a task (if there isn’t one already) in Scylla Operator to add E2E tests in AWS and work on this in the near future.

Either way, I am approving this PR.

Dropping SCT tests the existing scylla-operator test coverage gets cut significantly from the quality perspective point of view.
It is stepping back for about 2-3 years.

This particular PR is only about making sense for now - do not waste money while it is not used.
I still believe it is a mistake to drop the SCT test coverage which is unique.

mflendrich

OK, based on the context above, operating under an assumption that these tests provide substantial value that we shouldn't forgo: I think we should stick with the original plan of keeping this coverage as-is, only unbreaking now [1] vs later [2].

@grzywin can you assess the amount of work required to unbreak the job (as we discussed yesterday)?

grzywin · 2025-01-30T10:32:12Z

@mflendrich I will take a look on those failing jobs today and get back to you.

@ylebi As a side note, we should keep in mind that I am the only tester working on the Operator right now. As a result, in future it might be difficult to keep the SCT tests stable at all times while simultaneously working on tests in Go/scylla-operator repo.

mflendrich · 2025-01-30T10:47:17Z

@ylebi As a side note, we should keep in mind that I am the only tester working on the Operator right now. As a result, in future it might be difficult to keep the SCT tests stable at all times while simultaneously working on tests in Go/scylla-operator repo.

We should have this conversation next week - as per my earlier comment to figure out how to spend our resources best.

To me it looks like a planning exercise we'll need to conduct (whether building new coverage should come at the cost of sacrificing existing coverage in the SCTs).

Let's now focus on the technicalities of what it takes to fix these jobs now. After the weekend I'll start a discussion and we'll decide: which existing tests we want to invest to keep supporting, what we want to drop, and where we'll be adding new coverage. I hope that this approach can bring some clarity and address everyone's concerns here.

ci(k8s): disable weekly triggers

cb9743b

github-actions bot assigned vponomaryov Jan 28, 2025

vponomaryov added the backport/none Backport is not required label Jan 28, 2025

vponomaryov requested review from fruch, roydahan, soyacz, grzywin and ylebi January 28, 2025 11:23

pehala approved these changes Jan 28, 2025

View reviewed changes

ylebi requested a review from mflendrich January 28, 2025 15:46

grzywin approved these changes Jan 30, 2025

View reviewed changes

mflendrich suggested changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(k8s): disable weekly triggers #9928

ci(k8s): disable weekly triggers #9928

vponomaryov commented Jan 28, 2025 •

edited

Loading

vponomaryov commented Jan 28, 2025

ylebi commented Jan 28, 2025

mflendrich commented Jan 29, 2025

pehala commented Jan 29, 2025

mflendrich commented Jan 29, 2025

zimnx commented Jan 29, 2025

mflendrich commented Jan 29, 2025

grzywin commented Jan 30, 2025

grzywin left a comment

vponomaryov commented Jan 30, 2025

mflendrich left a comment

grzywin commented Jan 30, 2025

mflendrich commented Jan 30, 2025

ci(k8s): disable weekly triggers #9928

Are you sure you want to change the base?

ci(k8s): disable weekly triggers #9928

Conversation

vponomaryov commented Jan 28, 2025 • edited Loading

Testing

PR pre-checks (self review)

Reminders

vponomaryov commented Jan 28, 2025

ylebi commented Jan 28, 2025

mflendrich commented Jan 29, 2025

pehala commented Jan 29, 2025

mflendrich commented Jan 29, 2025

zimnx commented Jan 29, 2025

mflendrich commented Jan 29, 2025

grzywin commented Jan 30, 2025

grzywin left a comment

Choose a reason for hiding this comment

vponomaryov commented Jan 30, 2025

mflendrich left a comment

Choose a reason for hiding this comment

grzywin commented Jan 30, 2025

mflendrich commented Jan 30, 2025

vponomaryov commented Jan 28, 2025 •

edited

Loading