Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3017: Promote unhealthyPodEvictionPolicy for PDBs to Beta #3777

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

atiratree
Copy link
Member

  • One-line PR description: Promote unhealthyPodEvictionPolicy for PDBs to Beta
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 24, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 24, 2023
@atiratree atiratree force-pushed the update-healthy-policy branch from 7a21d77 to 2cde840 Compare January 24, 2023 13:48
@atiratree
Copy link
Member Author

/assign @ravisantoshgudimetla
/assign @soltysh
/assign @liggitt

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

keps/prod-readiness/sig-apps/3017.yaml Outdated Show resolved Hide resolved
Comment on lines 310 to 313
- As `IfHealthyBudget` is generally the less desirable option, it would be good to inform users that they are using it.
We can achieve this by defaulting `spec.unhealthyPodEvictionPolicy` to a `IfHealthyBudget` value.
This might prompt users to re-evaluate the value and set it to `AlwaysAllow` which is preferable for most of the applications.
The default value should preserve original behavior and handle null values. This should be tested and documented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence whether to default the field or not... I'm not sure how much it would actually inform/drive user awareness. @soltysh, thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if defaulting will help with the adoption or rise awareness, @wojtek-t also pointed out that here that it might cause issues for existing CI/CD flows where users have predefined PDBs with nil values for those fields which would cause them being pushed to the server. I'm inclined to leave it as is, for now.

Do we have a precedence before for changing defaults in similar situations? If we decide to do that, I'd probably go the path of adding a warning for nil values and setting the default in the next release, to give users one release to adopt, maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's plenty of precedence for defaulting of newly introduced fields, conditional on their feature being enabled, which is what is being considered here:

git grep 'Gate.Enabled' -- 'pkg/apis/*defaults.go'
pkg/apis/apps/v1/defaults.go:           if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1/defaults.go:   if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/apps/v1beta1/defaults.go:      if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/apps/v1beta1/defaults.go:              if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1beta2/defaults.go:              if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1beta2/defaults.go:      if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/storage/v1/defaults.go:        if obj.Spec.SELinuxMount == nil && utilfeature.DefaultFeatureGate.Enabled(features.SELinuxMountReadWriteOncePod) {
pkg/apis/storage/v1beta1/defaults.go:   if obj.Spec.SELinuxMount == nil && utilfeature.DefaultFeatureGate.Enabled(features.SELinuxMountReadWriteOncePod) {

that said, there's also evidence that some clients don't handle defaulted fields correctly / gracefully in general (including the kubelet choking on new defaulted fields inside podspec - kubernetes/kubernetes#63814, or clients doing diffs being unhappy with defaulted fields - argoproj/argo-cd#11143), so unless there's a strong reason to default, I'd probably leave the field nil by default and document which explicit option that corresponds to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so unless there's a strong reason to default, I'd probably leave the field nil by default and document which explicit option that corresponds to

+1 (unless there are other reasons not mentioned here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for nil, in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, updated the KEP to keep the field nil by default

@atiratree atiratree force-pushed the update-healthy-policy branch from 2cde840 to 458d468 Compare January 24, 2023 17:46
@@ -530,6 +532,7 @@ For each of them, fill in the following information by copying the below templat
- 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24.
- 2022-11-11: Initial alpha implementation merged into 1.26
- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy)
- 2023-01-24: Update for beta promotion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not change the default behavior. Users will have to specify a policy on the PDB for behavior to be affected.

What happens if a policy is specified and then the feature is disabled?

@@ -530,6 +532,7 @@ For each of them, fill in the following information by copying the below templat
- 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24.
- 2022-11-11: Initial alpha implementation merged into 1.26
- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy)
- 2023-01-24: Update for beta promotion
keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md Outdated Show resolved Hide resolved
@@ -277,6 +277,7 @@ https://storage.googleapis.com/k8s-triage/index.html
- <test>: <link to test coverage>
-->

[TestEvictionWithUnhealthyPodEvictionPolicy](https://github.com/kubernetes/kubernetes/blob/c8010537913422cc221cdd784936ff99817f621c/test/integration/evictions/evictions_test.go#L417): https://storage.googleapis.com/k8s-triage/index.html?test=UnhealthyPodEvictionPolicy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments (on top of what @logicalhan already pointed below too):

L364: "No, but they will be added for alpha"

Alpha has happened - can you update with the test that were added?

L370: "It does not change the default behavior. Users will have to specify a policy
on the PDB for behavior to be affected."

That's not really answer to the question. The question is not what changes, but how can a rollout fail? (e.g. apiserver can start crashing, etc.)

L375: Please specify the exact metric that reports that

L378: was this test run? If so, please update. If not, please ensure this will happen prior promoting feature gate to beta.

L400: the whole monitoring section is not filled-in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this test run? If so, please update. If not, please ensure this will happen prior promoting feature gate to beta.

I will try to run this test soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following up here #3777 (review)

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atiratree try addressing the comments from Han and Wojtek asap

@atiratree atiratree force-pushed the update-healthy-policy branch from 458d468 to 43b197b Compare February 6, 2023 15:48
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 6, 2023
@atiratree
Copy link
Member Author

Thanks for reviews everyone. I have tried to resolve all the feedback. Can you please take a look again @logicalhan @wojtek-t @soltysh @liggitt


#### GA

- Every bug report is fixed.
- Introduce E2E tests for this field and confirm their stability.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to have the E2E tests before GA or should it be sooner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having a test run at beta, when we enable the feature gate as the default is usually sufficient.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
from sig-apps pov


#### GA

- Every bug report is fixed.
- Introduce E2E tests for this field and confirm their stability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having a test run at beta, when we enable the feature gate as the default is usually sufficient.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2023
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two more minor comments - other than that LGTM from PRR perspective.

@@ -359,18 +361,25 @@ The eviction API will again start using the `unhealthyPodEvictionPolicy` if prov

###### Are there any tests for feature enablement/disablement?

No, but they will be added for alpha.
- [TestPodDisruptionBudgetStrategy](https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L46)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all great and needed tests, but they are more of a feature tests.

Enablement/disablement tests are tests that verify what happens if we change the FG in the middle of the test.

Wlll you be able to add a test like this one described in the template comment:
https://github.com/kubernetes/enhancements/blame/master/keps/NNNN-kep-template/README.md#L505-L509
[and add a TODO for it here?]

Copy link
Member Author

@atiratree atiratree Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there are 2 tests of these that do that https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L96-L114. Nevertheless I have added a TODO to add more comprehensive tests for disablement/enablement,

@atiratree atiratree force-pushed the update-healthy-policy branch from 43b197b to e63c0c6 Compare February 7, 2023 17:37
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 7, 2023
8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for
StatefulSet B uses the default behavior.
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior.
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done the manual upgrade->downgrade->upgrade test, but the outcome is not the one as originally described by @mortent.

That is, when we downgrade to 1.25 the unhealthyPodEvictionPolicy field is preserved in etcd, until an update happens to the PDB. Eg by dirsuption controller when updating status, which is triggered for example by eviction that we do in step 5. The field is then removed from etcd and will not be present when we upgrade to 1.26.

Only hacky way to preserve it is to ensure there is no update to the PDB.

I am not sure if this is expected behaviour for new fields. I have found something relevant mentioned here https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md

It must be possible to rollback to a previous version of API server that does not include your change and have no impact on API objects which do not use your change. API objects that use your change will be impacted in case of a rollback.

so it seems we should expect impact on API objects in case of a rollback, so the current behaviour would be correct?

I have updated the KEP to reflect that, but I am not sure if I am not missing something. Can @liggitt or anybody else confirm that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this was rollback to the version that doesn't recognize this field, so if something was touching this object, the fact that this field is dropped isn't unexpected.

I think it would be useful to make this test again but switching from 1.27 -> 1.26 -> 1.27, in which case you would be downgrading to a version that already supports that and see that (an alternative is to just disable FG within 1.26 version).

If you could add a TODO for doing that, it would be great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming.

Good idea, I have added a TODO to test the 1.27 upgrade path

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atiratree - I just have one super minor comment.
I'm approving it so that you don't have to wait for me, but please apply before merging.

/approve PRR
/hold

No, but they will be added for alpha.
- [TestPodDisruptionBudgetStrategy](https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L96-L114)

TODO: add more comprehensive enablement/disablement tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing me at the tests above - I missed that somehow when looking into all those tests initially.

This is already great - if you have cycles adding more is of course appreciated, but the ones that you mention above actually make me already happy :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in that case I have removed the TODO for more tests :)

8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for
StatefulSet B uses the default behavior.
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior.
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - this was rollback to the version that doesn't recognize this field, so if something was touching this object, the fact that this field is dropped isn't unexpected.

I think it would be useful to make this test again but switching from 1.27 -> 1.26 -> 1.27, in which case you would be downgrading to a version that already supports that and see that (an alternative is to just disable FG within 1.26 version).

If you could add a TODO for doing that, it would be great.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 8, 2023
@atiratree atiratree force-pushed the update-healthy-policy branch from e63c0c6 to f2b84af Compare February 8, 2023 11:10
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the comments from Wojtek got addressed
/hold cancel
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 8, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 74bd630 into kubernetes:master Feb 8, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Feb 8, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Feb 8, 2023

For posterity - this LGTM - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants