From d188a6481ef45cb08e3a788fc56f8ffc38dbad38 Mon Sep 17 00:00:00 2001 From: Ben Swartzlander Date: Wed, 3 Mar 2021 15:05:48 -0500 Subject: [PATCH] Split Volume Populators KEP Create a new KEP to address to long-standing bug in the PVC admission controller. This KEP is related to the Volume Populators KEP but either KEP can stand alone. The primary reasons to have a separate KEP are to allow discussion of the specific bugfix, and to ensure that any fix clearly communicated in the release notes. --- keps/prod-readiness/sig-storage/2546.yaml | 3 + .../1495-volume-populators/README.md | 4 + .../README.md | 577 ++++++++++++++++++ .../kep.yaml | 21 + 4 files changed, 605 insertions(+) create mode 100644 keps/prod-readiness/sig-storage/2546.yaml create mode 100644 keps/sig-storage/2546-reject-invalid-pvc-datasource/README.md create mode 100644 keps/sig-storage/2546-reject-invalid-pvc-datasource/kep.yaml diff --git a/keps/prod-readiness/sig-storage/2546.yaml b/keps/prod-readiness/sig-storage/2546.yaml new file mode 100644 index 000000000000..a859749d17c6 --- /dev/null +++ b/keps/prod-readiness/sig-storage/2546.yaml @@ -0,0 +1,3 @@ +kep-number: 2546 +beta: + approver: "@deads2k" diff --git a/keps/sig-storage/1495-volume-populators/README.md b/keps/sig-storage/1495-volume-populators/README.md index dbf5a9691bea..23d797166f9a 100644 --- a/keps/sig-storage/1495-volume-populators/README.md +++ b/keps/sig-storage/1495-volume-populators/README.md @@ -62,6 +62,8 @@ Check these off as they are completed for the Release Team to track. These check ## Summary +[Reject Invalid PVC DataSource KEP]: ../2546-reject-invalid-pvc-datasource/README.md + In Kubernetes 1.12, we added the `DataSource` field to the PVC spec. The field was implemented as a `TypedLocalObjectReference` to give flexibility in the future about what objects could be data sources for new volumes. @@ -86,6 +88,8 @@ perform the validation on those data sources and provide feedback to users. We w a new CRD to register valid datasource kinds, and individual volume populators will create a CR for each kind of data source that it understands. +See [Reject Invalid PVC DataSource KEP] for changes to `DataSource` field handling. + ## Motivation ### Goals diff --git a/keps/sig-storage/2546-reject-invalid-pvc-datasource/README.md b/keps/sig-storage/2546-reject-invalid-pvc-datasource/README.md new file mode 100644 index 000000000000..1511bf12ef92 --- /dev/null +++ b/keps/sig-storage/2546-reject-invalid-pvc-datasource/README.md @@ -0,0 +1,577 @@ +# Reject Invalid PVC DataSource + +## Table of Contents + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Risks and Mitigations](#risks-and-mitigations-1) +- [Design Details](#design-details-1) + - [Test Plan](#test-plan-1) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +**ACTION REQUIRED:** In order to merge code into a release, there must be an issue in [kubernetes/enhancements] referencing this KEP and targeting a release milestone **before [Enhancement Freeze](https://github.com/kubernetes/sig-release/tree/master/releases) +of the targeted release**. + +For enhancements that make changes to code or processes/procedures in core Kubernetes i.e., [kubernetes/kubernetes], we require the following Release Signoff checklist to be completed. + +Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released. + +- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR) +- [ ] KEP approvers have set the KEP status to `implementable` +- [ ] Design details are appropriately documented +- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs). + +**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone. + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues +[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes +[kubernetes/website]: https://github.com/kubernetes/website + +## Summary + +[Volume Populators KEP]: ../1495-volume-populators/README.md + +While developing the [Volume Populators KEP] it became obvious that the way the admission +controller handles the `DataSource` field of PVC specs is contrary to everyone's intentions. + +The existing behavior, which has been in place since 1.12, is to clear the field if the user +supplies a value that is not one of the supported values. This results in Kubernetes behaving +as if the user requested an empty volume, and indeed the user is given an empty volume in +that case (absent any other errors). + +Developers never intended the `DataSource` field to work this way, and it's clearly not good +practice to ignore user input. Most likely this was an accidental behavior resulting from the +need to support disabling feature gates for specific kinds of data sources. + +Users cannot possibly depend on this behavior, because for every invalid request that we are +currently incorrectly treating as valid, there is an actual valid request that results in the +exact same result -- specifically a PVC with an empty data source. + +## Motivation + +### Goals + +- Fix the buggy admission controller behavior +- Generate a clearly visible release note for users in the unlikely event that anyone is + affected. + +### Non-Goals + +- This change makes sense regardless of the future of the Volume Populators KEP. As such + is it not a goal of this KEP to advance the Volume Populators feature, although both + KEPs together to make good sense. + +## Proposal + +We propose to correct this bad behavior by making the admission controller reject invalid +data sources instead of silently changing them. This would cause the admission controller +to reject PVCs that specify core object other than PVCs as data sources when the +AnyVolumeDataSource feature gate is enabled, and to reject PVCs that specify anything +other than the two supported data sources (PVCs and VolumeSnapshots) when the feature gate +is disabled. + +No existing workflows would be affected by this change, unless they are accidentally +specifying a data source when they wanted an empty volume. Correcting those workflows would +be trivial -- users would simply need to remove the data source if they wanted an empty +volume. Valid data sources would be completely unaffected. + +### Risks and Mitigations + +The main risk is if a user had some preconfigured workflow that involved creation of +PVCs with the `DataSource` field set to an invalid value. The workflow would be yielding +empty volumes today, and if that was acceptable (most likely because the user forgot +they ever put contents in the `DataSource` field) then it could go on unnoticed. Fixing +this bug will cause that user's workflow to suddenly break. + +The workaround for the user would be trivial, they would just need to clear the +`DataSource` field in their requests to obtain the old behavior. One can imagine +situations where changing the workflow is non-trivial, but it's not unheard of for users +to have to take actions upon Kubernetes upgrades so that we can correct bad decisions +from our past. + +The real question is the liklihood of users making the particular mistake outlined above. +Given that the only supported contents of that field was other PVCs and VolumeSnapshots, +and both of those types of data sources are GA today, it seems impossible that users +who wanted to clone a PVC or a VolumeSnapshot would be doing so incorrectly, because the +fact that they were receiving an empty volume would make them fix their workflow. It's +almost impossible to imagine a case where a user put something else in that field and +expected it to work, except for the handful of developers working on the volume populators +feature. The only remaining possibility is users who were intentionally trying out +something they expected not to work, and were surprised that it did work, and for some +reason continued to use the malformed workflow. This seems surpassinly unlikely. + +## Design Details + +### Test Plan + +* Add test case that creates a PVC with a `DataSource` pointing to a core object and expect +creation to fail. +* Add test case that creates a PVC with a `DataSource` pointing to a definitely-invalid +object and expect creation to fail. (This test case becomes obsolete after volume populator +modifies this.) + + +### Risks and Mitigations + + + +## Design Details + + + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled). + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + + + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + + +## Alternatives + +The only viable alternative that I can see is to forever enshrine the existing +behavior and accept that it's gross. \ No newline at end of file diff --git a/keps/sig-storage/2546-reject-invalid-pvc-datasource/kep.yaml b/keps/sig-storage/2546-reject-invalid-pvc-datasource/kep.yaml new file mode 100644 index 000000000000..6c5db5f49c8f --- /dev/null +++ b/keps/sig-storage/2546-reject-invalid-pvc-datasource/kep.yaml @@ -0,0 +1,21 @@ +title: Reject Invalid PVC DataSource +kep-number: 2546 +authors: + - "@bswartz" +owning-sig: sig-storage +participating-sigs: + - sig-api-machinery +status: implementable +creation-date: 2021-03-02 +last-updated: 2021-03-02 +reviewers: + - "@thockin" + - "@saad-ali" +approvers: + - "@thockin" + - "@saad-ali" +stage: stable +latest-milestone: "v1.21" +milestone: + stable: "v1.21" +disable-supported: false