KEP: (sig-testing) continuously deploy k8s prow #2540

chaodaiG · 2021-02-23T23:35:25Z

As suggested on sig-testing meeting today, creating this KEP for open discussion. This is based off of the design doc presented on sig-testing https://docs.google.com/document/d/1pBouf_tgJJ2Gga9Xa5-xObLjN7PiVXiBzL3RYTy56ng.

This proposal is aiming at automate the deployment of k8s prow for easing prow oncall workloads.

chaodaiG · 2021-02-23T23:37:10Z

/assign @spiffxp
as sig-testing chair
/assign @ameukam
as sig-release chair
/assign @alvaroaleman
as prow approver

Please feel free assign to other approvers that you see fit.

ameukam · 2021-02-23T23:40:28Z

@chaodaiG I'm not a SIG Release chair (you may have misunderstood what Aaron said. :-))
Instead:
/assign @kubernetes/sig-release-leads

chaodaiG · 2021-02-23T23:46:17Z

@chaodaiG I'm not a SIG Release chair (you may have misunderstood what Aaron said. :-))
Instead:
/assign @kubernetes/sig-release-leads

Thank you Arnaud!

chaodaiG · 2021-02-23T23:55:09Z

Can't repro this test failure locally on mac

chaodaiG · 2021-02-24T00:19:26Z

Repro on linux, the root cause was hack/update-toc.sh failed silently on mac

… not up-to-date See: kubernetes#2540 (comment) and https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/enhancements/2540/pull-enhancements-verify/1364361716720209920

As proposed in kubernetes/enhancements#2540, k8s prow will be bumped more frequently than once per work day, adding this label to the job so that it could help associate failures with specific prow versions

justaugustus · 2021-03-03T17:03:20Z

/cc @hasheddan @jeremyrickard @saschagrunert @justaugustus

Related kubernetes/enhancements#2540 Report prow deployment job status on Slack instead of oncall manually posting

spiffxp

/approve
/lgtm
I ask that @kubernetes/sig-release-leads and other interested parties please add comments here on what they want to see to get this to implementable

spiffxp · 2021-03-03T19:37:25Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+## Proposal
+
+Prow autobump PRs are automatically merged every hour, only on working hours of working days.


We discussed in sig testing meeting starting at a much longer interval given that some jobs don't complete for O(2h), and having a slower update rate may make for easier troubleshooting

I'll start with 3 hours for now.

spiffxp · 2021-03-03T19:38:59Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+#### Automated Merging of Prow Autobump PRs
+
+- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.


We should decouple prow auto bump from job image auto bump (I have an open issue for this in test-infra)

This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.

kubernetes/test-infra#21137 there's the issue for it

spiffxp · 2021-03-03T19:40:12Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/kep.yaml

+participating-sigs:
+  - sig-testing
+  - sig-release
+status: provisional


Address the review comments and I am happy to merge a followup that bumps this to implementable

spiffxp · 2021-03-03T19:46:35Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+- Manually apply the changes from rollback PR.
+
+```
+<<[UNRESOLVED]>>


I would to see communication plan as part of this, e.g how and when will you announce this change:

downstream

to the users of prow.k8s.io

We should include plank version in testgrid.k8s.io jobs (cc @MushuEE)

I think this shouldn't be subject to v1.21 release freeze dates, but we should have a plan for how to be respectful toward the test freeze -> release phase.

This should have alpha/beta/ga phases called out. Low frequency as part of alpha. Test grid plumbed as part of beta. High frequency and decoupled jobs as part of GA

Not super familiar with k8s release timeline. I see that Test Freeze - Wednesday, March 24, The deadline to submit an entry is Thursday, March 25, EOD Pacific Time from latest announcement. So the official release date is probably some time after March 25, should we wait until the official release of 1.21.0?

Have updated #2553 with:

announcement

alpha/beta/ga phases

k8s-ci-robot · 2021-03-03T19:48:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, chaodaiG, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-testing/OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

BenTheElder

first pass: there's a lot I'd like to see before implementable.

BenTheElder · 2021-03-03T20:09:59Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+## Implementation History
+
+
+## Alternatives


this does not seem very detailed currently.

My bad, added much more details

BenTheElder · 2021-03-03T20:10:09Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+When prow stopped functioning after a bump, prow oncall should:
+- Stop auto-deploying by commenting `/hold` on latest autobump PR.
+- Manually create rollback PR for rolling back to known good version.


Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while

BenTheElder · 2021-03-03T20:10:14Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+When prow stopped functioning after a bump, prow oncall should:
+- Stop auto-deploying by commenting `/hold` on latest autobump PR.
+- Manually create rollback PR for rolling back to known good version.
+- Manually apply the changes from rollback PR.


BenTheElder · 2021-03-03T20:10:35Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+#### Automated Merging of Prow Autobump PRs
+
+- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.


This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.

BenTheElder · 2021-03-03T20:11:04Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+#### Breaking Changes in Prow
+Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
+One possible way of dealing with breaking changes, is:
+- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md.


is this sufficient if we're merging hourly?

Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery

BenTheElder · 2021-03-03T20:11:18Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+### Notes/Constraints/Caveats (Optional)
+
+#### Breaking Changes in Prow
+Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.


perhaps this should change?

Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.

BenTheElder · 2021-03-03T20:12:14Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+#### Prow Users
+
+Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).


our most recent outage (the webhook handler panic-ing and dropping events) was discovered by non-oncall humans, even at the current update frequency.
how do we plan to address this?
kubernetes/test-infra#21090

kubernetes/test-infra#21090 (comment) let's noodle over here

kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?

My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.

Cross-posting: kubernetes/test-infra#21090 (comment), the issue in #21090 will be caught by prometheus alerting in the future

BenTheElder · 2021-03-03T20:12:47Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+- What’s Not Changed
+  - React to prow alerts and take actions.
+- What’s Changed
+  - No more manual inspecting prow healthiness.


in favor of?
one of the things manually inspected right now is the logs, in which we have failures that are not caught by automated monitoring today.

I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.

chaodaiG

Responded to most of comments here, and the mentioned improvements are all included in #2553.

There is one comment https://github.com/kubernetes/enhancements/pull/2540/files#r586729058 not covered yet, will need to think a little bit how to write those

chaodaiG · 2021-03-03T23:06:50Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+#### Prow Users
+
+Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).


kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?

My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.

chaodaiG · 2021-03-03T23:10:14Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+- What’s Not Changed
+  - React to prow alerts and take actions.
+- What’s Changed
+  - No more manual inspecting prow healthiness.


I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.

chaodaiG · 2021-03-03T23:11:26Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+## Proposal
+
+Prow autobump PRs are automatically merged every hour, only on working hours of working days.


I'll start with 3 hours for now.

chaodaiG · 2021-03-03T23:16:04Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+#### Breaking Changes in Prow
+Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.
+One possible way of dealing with breaking changes, is:
+- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md.


Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery

chaodaiG · 2021-03-03T23:17:11Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+### Notes/Constraints/Caveats (Optional)
+
+#### Breaking Changes in Prow
+Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs.


Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.

chaodaiG · 2021-03-03T23:22:44Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+
+When prow stopped functioning after a bump, prow oncall should:
+- Stop auto-deploying by commenting `/hold` on latest autobump PR.
+- Manually create rollback PR for rolling back to known good version.


Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while

chaodaiG · 2021-03-03T23:24:38Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+When prow stopped functioning after a bump, prow oncall should:
+- Stop auto-deploying by commenting `/hold` on latest autobump PR.
+- Manually create rollback PR for rolling back to known good version.
+- Manually apply the changes from rollback PR.


chaodaiG · 2021-03-03T23:34:52Z

keps/sig-testing/2539-continuously-deploy-k8s-prow/README.md

+## Implementation History
+
+
+## Alternatives


My bad, added much more details

spiffxp · 2021-03-05T07:08:27Z

ref: #2539

KEP-2539: Addressing comments from #2540

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Feb 23, 2021

k8s-ci-robot requested review from spiffxp and stevekuznetsov February 23, 2021 23:35

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 23, 2021

k8s-ci-robot assigned alvaroaleman, ameukam and spiffxp Feb 23, 2021

chaodaiG force-pushed the kep-prow-cd branch from 30faaa1 to 6f2a674 Compare February 23, 2021 23:49

alvaroaleman approved these changes Feb 24, 2021

View reviewed changes

Add a KEP at sig-testing: continuously deploy k8s prow

4cd4462

chaodaiG force-pushed the kep-prow-cd branch from 6f2a674 to 4cd4462 Compare February 24, 2021 00:18

chaodaiG mentioned this pull request Feb 24, 2021

KEP: (sig-testing) Continuously Deploy K8s Prow #2539

Closed

4 tasks

chaodaiG mentioned this pull request Feb 24, 2021

Add more debugging message when toc related script failed due to grep… #2541

Merged

chaodaiG mentioned this pull request Feb 26, 2021

Prow job labeled with prow version kubernetes/test-infra#21054

Merged

Annotated prow job with plank version kubernetes/test-infra#21054

1789c03

chaodaiG force-pushed the kep-prow-cd branch from 0c1c92c to 1789c03 Compare March 1, 2021 16:50

k8s-ci-robot requested review from hasheddan, jeremyrickard, justaugustus and saschagrunert March 3, 2021 17:03

chaodaiG added a commit to chaodaiG/test-infra that referenced this pull request Mar 3, 2021

Report prow deployment job status on Slack

9376c48

Related kubernetes/enhancements#2540 Report prow deployment job status on Slack instead of oncall manually posting

chaodaiG mentioned this pull request Mar 3, 2021

Report prow deployment job status on Slack kubernetes/test-infra#21136

Merged

chaodaiG added a commit to chaodaiG/test-infra that referenced this pull request Mar 3, 2021

Report prow deployment job status on Slack

5c7fb48

Related kubernetes/enhancements#2540 Report prow deployment job status on Slack instead of oncall manually posting

spiffxp approved these changes Mar 3, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2021

k8s-ci-robot merged commit fd30cf6 into kubernetes:master Mar 3, 2021

k8s-ci-robot added this to the v1.21 milestone Mar 3, 2021

BenTheElder reviewed Mar 3, 2021

View reviewed changes

chaodaiG deleted the kep-prow-cd branch March 3, 2021 22:46

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021

KEP-2539: Addressing comments from kubernetes#2540

cb86427

chaodaiG mentioned this pull request Mar 3, 2021

KEP-2539: Addressing comments from #2540 #2553

Merged

chaodaiG commented Mar 3, 2021

View reviewed changes

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021

KEP-2539: Addressing comments from kubernetes#2540

a133403

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 3, 2021

KEP-2539: Addressing comments from kubernetes#2540

988400a

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 4, 2021

KEP-2539: Addressing comments from kubernetes#2540

40f842e

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 4, 2021

KEP-2539: Addressing comments from kubernetes#2540

b11b75f

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 10, 2021

KEP-2539: Addressing comments from kubernetes#2540

b259885

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 10, 2021

KEP-2539: Addressing comments from kubernetes#2540

def3503

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021

KEP-2539: Addressing comments from kubernetes#2540

b7d02af

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021

KEP-2539: Addressing comments from kubernetes#2540

5809ef0

chaodaiG added a commit to chaodaiG/enhancements that referenced this pull request Mar 31, 2021

KEP-2539: Addressing comments from kubernetes#2540

70e9e00

k8s-ci-robot added a commit that referenced this pull request Mar 31, 2021

Merge pull request #2553 from chaodaiG/prow-deploy-address-comment

9d4deb4

KEP-2539: Addressing comments from #2540

palnabarun mentioned this pull request Jun 12, 2022

canonical kep number json field palnabarun/enhancements#1

Closed

palnabarun mentioned this pull request Jul 26, 2022

KEP Manifest Generator palnabarun/enhancements#2

Closed


		## Proposal

		Prow autobump PRs are automatically merged every hour, only on working hours of working days.


		#### Automated Merging of Prow Autobump PRs

		- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour.


		#### Prow Users

		Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day).

KEP: (sig-testing) continuously deploy k8s prow #2540

KEP: (sig-testing) continuously deploy k8s prow #2540

Conversation

chaodaiG commented Feb 23, 2021

chaodaiG commented Feb 23, 2021

ameukam commented Feb 23, 2021

chaodaiG commented Feb 23, 2021

chaodaiG commented Feb 23, 2021

chaodaiG commented Feb 24, 2021

justaugustus commented Mar 3, 2021

spiffxp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Mar 3, 2021

BenTheElder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaodaiG left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Mar 5, 2021