Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Volume Populators KEP #2552

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-storage/2546.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 2546
stable:
approver: "@deads2k"
4 changes: 4 additions & 0 deletions keps/sig-storage/1495-volume-populators/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ Check these off as they are completed for the Release Team to track. These check

## Summary

[Reject Invalid PVC DataSource KEP]: ../2546-reject-invalid-pvc-datasource/README.md

In Kubernetes 1.12, we added the `DataSource` field to the PVC spec. The field was implemented
as a `TypedLocalObjectReference` to give flexibility in the future about what objects could be
data sources for new volumes.
Expand All @@ -86,6 +88,8 @@ perform the validation on those data sources and provide feedback to users. We w
a new CRD to register valid datasource kinds, and individual volume populators will
create a CR for each kind of data source that it understands.

See [Reject Invalid PVC DataSource KEP] for changes to `DataSource` field handling.

## Motivation

### Goals
Expand Down
286 changes: 286 additions & 0 deletions keps/sig-storage/2546-reject-invalid-pvc-datasource/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
# Reject Invalid PVC DataSource

## Table of Contents

<!-- toc -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be missing PRR...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this.

- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Alternatives](#alternatives)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
<!-- /toc -->

## Release Signoff Checklist

**ACTION REQUIRED:** In order to merge code into a release, there must be an issue in [kubernetes/enhancements] referencing this KEP and targeting a release milestone **before [Enhancement Freeze](https://github.com/kubernetes/sig-release/tree/master/releases)
of the targeted release**.

For enhancements that make changes to code or processes/procedures in core Kubernetes i.e., [kubernetes/kubernetes], we require the following Release Signoff checklist to be completed.

Check these off as they are completed for the Release Team to track. These checklist items _must_ be updated for the enhancement to be released.

- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- [ ] KEP approvers have set the KEP status to `implementable`
- [ ] Design details are appropriately documented
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] Graduation criteria is in place
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).

**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://github.com/kubernetes/enhancements/issues
[kubernetes/kubernetes]: https://github.com/kubernetes/kubernetes
[kubernetes/website]: https://github.com/kubernetes/website

## Summary

[Volume Populators KEP]: ../1495-volume-populators/README.md

While developing the [Volume Populators KEP] it became obvious that the way the admission
controller handles the `DataSource` field of PVC specs is contrary to everyone's intentions.

The existing behavior, which has been in place since 1.12, is to clear the field if the user
supplies a value that is not one of the supported values. This results in Kubernetes behaving
as if the user requested an empty volume, and indeed the user is given an empty volume in
that case (absent any other errors).

Developers never intended the `DataSource` field to work this way, and it's clearly not good
practice to ignore user input. Most likely this was an accidental behavior resulting from the
need to support disabling feature gates for specific kinds of data sources.

Users cannot possibly depend on this behavior, because for every invalid request that we are
currently incorrectly treating as valid, there is an actual valid request that results in the
exact same result -- specifically a PVC with an empty data source.

## Motivation

### Goals

- Fix the buggy admission controller behavior
- Generate a clearly visible release note for users in the unlikely event that anyone is
affected.

### Non-Goals

- This change makes sense regardless of the future of the Volume Populators KEP. As such
is it not a goal of this KEP to advance the Volume Populators feature, although both
KEPs together to make good sense.

## Proposal

We propose to correct this bad behavior by making the admission controller reject invalid
data sources instead of silently changing them. This would cause the admission controller
to reject PVCs that specify core object other than PVCs as data sources when the
AnyVolumeDataSource feature gate is enabled, and to reject PVCs that specify anything
other than the two supported data sources (PVCs and VolumeSnapshots) when the feature gate
is disabled.

No existing workflows would be affected by this change, unless they are accidentally
specifying a data source when they wanted an empty volume. Correcting those workflows would
be trivial -- users would simply need to remove the data source if they wanted an empty
volume. Valid data sources would be completely unaffected.

### Risks and Mitigations

The main risk is if a user had some preconfigured workflow that involved creation of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the new failure mode is:

before:

  • Does not have alpha AnyVolumeDataSource enabled
  • Creates PVC with invalid source
  • Source gets wiped
  • Success

after this change:

  • Does not have alpha AnyVolumeDataSource enabled
  • Creates PVC with invalid source
  • Error

So it is, strictly speaking, a very breaking change. I agree that is seems INCREDIBLY unlikely that anyone is depending on this, and yet it makes me anxious.

@liggitt is this an appropriate use for API warnings? Like, should we issue warnings for a release or two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your characterization is completely accurate. The only thing I'll add is that in the extremely unlike nightmare scenario you describe, the fix for an end-user is mind-blowingly simple -- just stop sending illegal data sources when creating PVCs. It's even harder to imagine the corner-case of a corner-case where someone both stumbles across this behavior change AND it takes then more than 10 seconds to work around the problem by updating their YAML.

PVCs with the `DataSource` field set to an invalid value. The workflow would be yielding
empty volumes today, and if that was acceptable (most likely because the user forgot
they ever put contents in the `DataSource` field) then it could go on unnoticed. Fixing
this bug will cause that user's workflow to suddenly break.

The workaround for the user would be trivial, they would just need to clear the
Copy link
Member

@liggitt liggitt May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on where the request is coming from, it might be:

  • trivial (manifest under the user's control)
  • possible (programmatic client that needs to be updated)
  • difficult (programmatic client not under the user's control that needs to be updated)
  • impossible (already-persisted statefulset with a spec.volumeClaimTemplates with an invalid data source)

What is the recourse for users on the more difficult end of the scale? How much time do we give them to move? Do we need to make changes to things like statefulset update validation to allow fixing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even without this change, when the AnyVolumeDataSource feature goes beta, they will have this problem if they're using non-core objects (other than snapshots) as datasources. Those PVCs will be allowed to create but they will no longer bind. Those users would have to switch the feature gate off to get out of trouble if they want to continue to get empty volumes despite the non-empty data source field.

This PR extends the net to also catch PVCs with core object data sources, and to outright reject those because there isn't a future where they will become valid (here we presume that a failure to create a PVC is actually a more friendly error than a successfully created PVC that never binds). The only alternative I can see is perhaps to wrap up this behavior behind the same feature gate, but then we can't actually remove the DropDisabledFields() logic yet.

`DataSource` field in their requests to obtain the old behavior. One can imagine
situations where changing the workflow is non-trivial, but it's not unheard of for users
to have to take actions upon Kubernetes upgrades so that we can correct bad decisions
from our past.

The real question is the liklihood of users making the particular mistake outlined above.
Given that the only supported contents of that field was other PVCs and VolumeSnapshots,
and both of those types of data sources are GA today, it seems impossible that users
who wanted to clone a PVC or a VolumeSnapshot would be doing so incorrectly, because the
fact that they were receiving an empty volume would make them fix their workflow. It's
Comment on lines +116 to +118
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the clone was intended to prime data as an optimization, and the app handled an empty volume gracefully by doing work from scratch, it seems very possible for this to be overlooked accidentally, and still be functional.

almost impossible to imagine a case where a user put something else in that field and
expected it to work, except for the handful of developers working on the volume populators
feature. The only remaining possibility is users who were intentionally trying out
something they expected not to work, and were surprised that it did work, and for some
reason continued to use the malformed workflow. This seems surpassinly unlikely.

## Design Details

### Test Plan

* Add test case that creates a PVC with a `DataSource` pointing to a core object and expect
creation to fail.
* Add test case that creates a PVC with a `DataSource` pointing to a definitely-invalid
object and expect creation to fail. (This test case becomes obsolete after volume populator
modifies this.)

## Alternatives

The only viable alternative that I can see is to forever enshrine the existing
behavior and accept that it's gross.

## Production Readiness Review Questionnaire

<!--

Production readiness reviews are intended to ensure that features merging into
Kubernetes are observable, scalable and supportable; can be safely operated in
production environments, and can be disabled or rolled back in the event they
cause increased failures in production. See more in the PRR KEP at
https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.

The production readiness review questionnaire must be completed and approved
for the KEP to move to `implementable` status and be included in the release.

In some cases, the questions below should also have answers in `kep.yaml`. This
is to enable automation to verify the presence of the review, and to reduce review
burden and latency.

The KEP must have a approver from the
[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
team. Please reach out on the
[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
you need any help or guidance.
-->

### Feature Enablement and Rollback

###### How can this feature be enabled / disabled in a live cluster?

- [ ] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name:
- Components depending on the feature gate:
- [X] Other
- Describe the mechanism: Cannot be disabled
- Will enabling / disabling the feature require downtime of the control
plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).

###### Does enabling the feature change any default behavior?

Use of invalid datasources, which would previously have been ignored and yielded
an empty volume, will now properly result in an error.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

No

###### What happens if we reenable the feature if it was previously rolled back?

n/a

###### Are there any tests for feature enablement/disablement?

No, the change will always be on.

### Rollout, Upgrade and Rollback Planning

###### How can a rollout fail? Can it impact already running workloads?

As this is effectively a bug fix, the rollout cannot fail. Also, because the change
alters PVC creation, already-running workloads won't be affected.

###### What specific metrics should inform a rollback?

Rollback is not possible.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

n/a

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

### Monitoring Requirements

###### How can an operator determine if the feature is in use by workloads?

This is a bug fix, so not applicable.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [X] Other (treat as last resort)
- Details: It's not a service, and doesn't have a health status per se. One could track
failed PVC creations to notice if users were relying on the previously buggy behavior.

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

The expectation is that nobody would rely on this particular bug, because if they wanted
an empty volume, it's much easier to leave the data source field empty. So we don't
anticipate the error case ever occurring.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Failed PVC creations.

### Dependencies

###### Does this feature depend on any specific services running in the cluster?

No, it's a change to kube-api-server.

### Scalability

###### Will enabling / using this feature result in any new API calls?

No

###### Will enabling / using this feature result in introducing new API types?

No

###### Will enabling / using this feature result in any new calls to the cloud provider?

No

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

No

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

No impact, because the change is in API server itself.

###### What are other known failure modes?

The only way this change can "fail" is if a user was relying on the old buggy behavior.
The best remediation in that case is for the user to stop filling in the `DataSource`
field of the PVC with an invalid value and leave it empty isntead.

###### What steps should be taken if SLOs are not being met to determine the problem?

Fix the user's PVC definitions.
21 changes: 21 additions & 0 deletions keps/sig-storage/2546-reject-invalid-pvc-datasource/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
title: Reject Invalid PVC DataSource
kep-number: 2546
authors:
- "@bswartz"
owning-sig: sig-storage
participating-sigs:
- sig-api-machinery
status: implementable
creation-date: 2021-03-02
last-updated: 2021-05-11
reviewers:
- "@thockin"
- "@saad-ali"
approvers:
- "@thockin"
- "@saad-ali"
stage: stable
latest-milestone: "v1.22"
milestone:
stable: "v1.22"
disable-supported: false