Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: (sig-testing) continuously deploy k8s prow #2540

Merged
merged 2 commits into from
Mar 3, 2021

Conversation

chaodaiG
Copy link
Contributor

As suggested on sig-testing meeting today, creating this KEP for open discussion. This is based off of the design doc presented on sig-testing https://docs.google.com/document/d/1pBouf_tgJJ2Gga9Xa5-xObLjN7PiVXiBzL3RYTy56ng.

This proposal is aiming at automate the deployment of k8s prow for easing prow oncall workloads.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 23, 2021
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021
@chaodaiG
Copy link
Contributor Author

/assign @spiffxp
as sig-testing chair
/assign @ameukam
as sig-release chair
/assign @alvaroaleman
as prow approver

Please feel free assign to other approvers that you see fit.

@ameukam
Copy link
Member

ameukam commented Feb 23, 2021

@chaodaiG I'm not a SIG Release chair (you may have misunderstood what Aaron said. :-))
Instead:
/assign @kubernetes/sig-release-leads

@chaodaiG
Copy link
Contributor Author

@chaodaiG I'm not a SIG Release chair (you may have misunderstood what Aaron said. :-))
Instead:
/assign @kubernetes/sig-release-leads

Thank you Arnaud!

@chaodaiG
Copy link
Contributor Author

Can't repro this test failure locally on mac

@chaodaiG
Copy link
Contributor Author

Repro on linux, the root cause was hack/update-toc.sh failed silently on mac

chaodaiG added a commit to chaodaiG/test-infra that referenced this pull request Feb 26, 2021
As proposed in kubernetes/enhancements#2540, k8s prow will be bumped more frequently than once per work day, adding this label to the job so that it could help associate failures with specific prow versions
@justaugustus
Copy link
Member

chaodaiG added a commit to chaodaiG/test-infra that referenced this pull request Mar 3, 2021
Related kubernetes/enhancements#2540

Report prow deployment job status on Slack instead of oncall manually posting
chaodaiG added a commit to chaodaiG/test-infra that referenced this pull request Mar 3, 2021
Related kubernetes/enhancements#2540

Report prow deployment job status on Slack instead of oncall manually posting
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm
I ask that @kubernetes/sig-release-leads and other interested parties please add comments here on what they want to see to get this to implementable


## Proposal

Prow autobump PRs are automatically merged every hour, only on working hours of working days.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed in sig testing meeting starting at a much longer interval given that some jobs don't complete for O(2h), and having a slower update rate may make for easier troubleshooting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start with 3 hours for now.


#### Automated Merging of Prow Autobump PRs

- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decouple prow auto bump from job image auto bump (I have an open issue for this in test-infra)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/test-infra#21137 there's the issue for it

participating-sigs:
- sig-testing
- sig-release
status: provisional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address the review comments and I am happy to merge a followup that bumps this to implementable

- Manually apply the changes from rollback PR.

```
<<[UNRESOLVED]>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would to see communication plan as part of this, e.g how and when will you announce this change:

  • downstream
  • to the users of prow.k8s.io

We should include plank version in testgrid.k8s.io jobs (cc @MushuEE)

I think this shouldn't be subject to v1.21 release freeze dates, but we should have a plan for how to be respectful toward the test freeze -> release phase.

This should have alpha/beta/ga phases called out. Low frequency as part of alpha. Test grid plumbed as part of beta. High frequency and decoupled jobs as part of GA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with k8s release timeline. I see that Test Freeze - Wednesday, March 24, The deadline to submit an entry is Thursday, March 25, EOD Pacific Time from latest announcement. So the official release date is probably some time after March 25, should we wait until the official release of 1.21.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated #2553 with:

  • announcement
  • alpha/beta/ga phases

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, chaodaiG, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021
@k8s-ci-robot k8s-ci-robot merged commit fd30cf6 into kubernetes:master Mar 3, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Mar 3, 2021
Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass: there's a lot I'd like to see before implementable.

## Implementation History


## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem very detailed currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, added much more details


When prow stopped functioning after a bump, prow oncall should:
- Stop auto-deploying by commenting `/hold` on latest autobump PR.
- Manually create rollback PR for rolling back to known good version.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while

When prow stopped functioning after a bump, prow oncall should:
- Stop auto-deploying by commenting `/hold` on latest autobump PR.
- Manually create rollback PR for rolling back to known good version.
- Manually apply the changes from rollback PR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


#### Automated Merging of Prow Autobump PRs

- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.

#### Breaking Changes in Prow
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
One possible way of dealing with breaking changes, is:
- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this sufficient if we're merging hourly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery

### Notes/Constraints/Caveats (Optional)

#### Breaking Changes in Prow
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this should change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.


#### Prow Users

Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our most recent outage (the webhook handler panic-ing and dropping events) was discovered by non-oncall humans, even at the current update frequency.
how do we plan to address this?
kubernetes/test-infra#21090

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/test-infra#21090 (comment) let's noodle over here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?

My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-posting: kubernetes/test-infra#21090 (comment), the issue in #21090 will be caught by prometheus alerting in the future

- What’s Not Changed
- React to prow alerts and take actions.
- What’s Changed
- No more manual inspecting prow healthiness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in favor of?
one of the things manually inspected right now is the logs, in which we have failures that are not caught by automated monitoring today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.

@chaodaiG chaodaiG deleted the kep-prow-cd branch March 3, 2021 22:46
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021
Copy link
Contributor Author

@chaodaiG chaodaiG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to most of comments here, and the mentioned improvements are all included in #2553.

There is one comment https://github.com/kubernetes/enhancements/pull/2540/files#r586729058 not covered yet, will need to think a little bit how to write those


#### Prow Users

Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?

My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.

- What’s Not Changed
- React to prow alerts and take actions.
- What’s Changed
- No more manual inspecting prow healthiness.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.


## Proposal

Prow autobump PRs are automatically merged every hour, only on working hours of working days.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll start with 3 hours for now.

#### Breaking Changes in Prow
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
One possible way of dealing with breaking changes, is:
- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery

### Notes/Constraints/Caveats (Optional)

#### Breaking Changes in Prow
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.


When prow stopped functioning after a bump, prow oncall should:
- Stop auto-deploying by commenting `/hold` on latest autobump PR.
- Manually create rollback PR for rolling back to known good version.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while

When prow stopped functioning after a bump, prow oncall should:
- Stop auto-deploying by commenting `/hold` on latest autobump PR.
- Manually create rollback PR for rolling back to known good version.
- Manually apply the changes from rollback PR.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

## Implementation History


## Alternatives
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, added much more details

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 4, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 4, 2021
@spiffxp
Copy link
Member

spiffxp commented Mar 5, 2021

ref: #2539

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 10, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 10, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021
chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021
k8s-ci-robot added a commit that referenced this pull request Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants