Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioned validation of referenced Pipelines/Tasks #6616

Closed
lbernick opened this issue May 3, 2023 · 16 comments · Fixed by #6701
Closed

Versioned validation of referenced Pipelines/Tasks #6616

lbernick opened this issue May 3, 2023 · 16 comments · Fixed by #6701
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@lbernick
Copy link
Member

lbernick commented May 3, 2023

We currently allow users to mix API versions of TaskRuns and the Tasks they reference, and likewise for PipelineRuns and the Pipelines they reference. We convert any referenced Tasks and Pipelines into the storage version of the API.

Later, after setting defaults and performing other operations, we validate the referenced object's spec.

For Pipelines, we do this by calling pipelineSpec.Validate, which validates the referenced Pipeline as if its API version is the storage version.

For Tasks, we do not call taskSpec.Validate; instead, we have customized validation functions that aren't defined in pkg/apis; e.g. ValidateResolvedTask, validateTaskSpecRequestResources, and a few others. The only validation function used from the apis package is taskSpec.ValidateParamArrayIndex, which again, validates the referenced Task as if its API version is the storage version. This is the cause of #6607.

This presents a problem when swapping the storage version of the API to v1. If a user previously defined a v1beta1 referenced Task or Pipeline, it gets converted to v1 and validated as if it were a v1 object. However, if the referenced object used beta features, and "enable-api-fields" is set to "stable", validation fails. This means that swapping the storage version to v1 would cause PipelineRuns and TaskRuns that previously worked to fail. Example logs from #6608.

We should perform validation before we convert the object into the storage version.

Other options:

  • set "enable-api-fields" to "beta" by default. This addresses the specific problem for users who accept the default behavior but doesn't address this problem for users who set "enable-api-fields" to "stable". (I'm not sure why someone using v1beta1 would actively choose to set "stable" if the default is "beta" but it still seems like a bug.) This is the easiest option.
  • don't convert referenced tasks/pipelines to the storage version: this would involve significantly reworking the codebase
@dibyom
Copy link
Member

dibyom commented May 3, 2023

We should perform validation before we convert the object into the storage version.

I'm not sure if this is always feasible - we validate results/params with values once they have been resolved

@lbernick
Copy link
Member Author

lbernick commented May 3, 2023

We should perform validation before we convert the object into the storage version.

I'm not sure if this is always feasible - we validate results/params with values once they have been resolved

I think this should be fine; I should clarify that I don't think we need to do all validation before converting; only validation of the remote Task/Pipeline spec.

@lbernick
Copy link
Member Author

lbernick commented May 3, 2023

I think a bigger problem might be trusted resources: we can only validate trusted resources before defaults are set, since the task/pipeline signature is computed before defaults are set. So the order of ops ends up needing to be 1. verification 2. set defaults 3. validate 4. convert to storage api version.
FYI @Yongxuanzhang

@lbernick
Copy link
Member Author

@dibyom on second thought you might be right that we need to address propagated param logic first. I've created #6647 to track this.

@lbernick
Copy link
Member Author

lbernick commented May 16, 2023

Update: I realized a new issue with local PipelineRefs/TaskRefs: A user can create a perfectly valid v1beta1 Pipeline/Task with enable-api-fields = stable, using beta features. This gets converted into the storage version of the API, and we don't know what API version it was originally created with. When swapping to storage version v1, this can break existing local PipelineRefs/TaskRefs.

There are multiple ways to address this problem, each with tradeoffs.

  1. The approach taken in Validate specs of remote Pipelines before conversion #6645: Perform validation of remote pipelines before conversion to the storage version. Skip validation of local Pipelines, since they were validated on creation by the webhook.
  • Pros: can swap storage version without breaking any Pipelines
  • Cons: May end up reconciling invalid Pipelines for local PipelineRefs. Treats local Pipelines differently from remote Pipelines.
  1. Perform validation of remote pipelines before conversion to the storage version. Perform validation of local Pipelines using the storage version of the API.
  • Pros: fail fast. Do not reconcile invalid Pipelines
  • Cons: can break v1beta1 local pipelines using beta features with "enable-api-fields" set to stable when we swap the storage version of the API. Local Pipelines treated differently than remote pipelines.
  1. an idea from @wlynch (thank you!): Add an annotation to conversion logic signifying the value of "enable-api-fields" to use, based on the original version of the local Pipeline. If a Pipeline has this annotation set, use this value when validating beta features.
  • Pros: fail fast; do not reconcile invalid Pipelines. Can swap storage version without breaking any Pipelines.
  • Cons: enable-api-fields is set to whatever version was used when the Pipeline was first created. Changes by the cluster operator have no effect for existing Pipelines.
  1. Separate enable-api-fields validation from the rest of our validation. Apply it only when a Pipeline is created or a remote pipeline is fetched, not when a local Pipeline is fetched.
  • Pros: fail fast; do not reconcile invalid Pipelines. Can swap storage version without breaking any Pipelines. May be easier to understand than solution 3.
  • Cons: enable-api-fields is effectively set to whatever version was used when the Pipeline was first created. Changes by the cluster operator have no effect for existing Pipelines. Requires significant refactoring.
  1. Use per-feature flags (Per Feature Feature Flags #5632). We plan on doing this anyway, but not before swapping the storage version to v1.
  • Pros: Addresses the problem without the drawbacks associated with the other solutions. We can decide whether per-feature flags should gate implementation as well as existence of fields.
  • Cons: Pushes v1 timeline back significantly (this still needs design work). Need to make these changes for every beta feature to unblock v1.
  1. Set enable-api-fields to "beta" by default (Decoupling API versioning and Feature versioning for features turned on by default #6592). We plan on doing this anyway.
  • Pros: super easy, reduces unexpected surprises when users swap their CRDs to v1
  • Cons: still breaks v1beta1 pipelines when we swap the storage version to v1 for users who have chosen to set "enable-api-fields" to "stable"

Would love to get some opinions on people's preferred approaches!

@lbernick
Copy link
Member Author

Chatted with @dibyom and @JeromeJu. We believe option 6 is the best, and hope to mitigate the impact with clear release notes.

Options 1 and 2 treat local Pipelines differently from remote pipelines. I also don't think we should skip validation for local Pipelines, as in option 1.

I think options 3 and 4 will be very difficult to explain to cluster operators. It'll be hard to understand what value of "enable-api-fields" is being applied and what happens when you change the cluster settings.

We should still pursue and prioritize option 5. It's worth noting our only beta features right now are resolution, object params/results, and array results/array indexing. I think we can get 3 per feature flags done reasonably quickly.

Still would appreciate more feedback!

@wlynch
Copy link
Member

wlynch commented May 19, 2023

I’m a bit skeptical of 6 - moving users from stable to beta feature flags by default feels like it may end up causing more problems. 😅 I don't think we want to make stable opt-in, and as you pointed out if you set the stable flag you’re still stuck not able to migrate. I think we should generally prioritize compatibility of stable over beta, though I understand this is a bit of a special case since it sounds like most (all?) of our usage for v1beta1 is using beta today.

For existing v1beta1 resources already under this behavior it seems fine and needed to ensure compatibility, but I’m not sure it makes sense to do for all new resources as well.
The idea for 3 was to carry over the feature flag values during conversion so that they only apply to existing resources that may have relied on the behavior. A local annotation overriding a global default is a common enough pattern that I'm not particularly worried about being confusing, and also gives control to authors to opt-out of the behavior as well. If cluster-wide control is a strict requirement, perhaps there can be an additional feature flag to disable the behavior? 🤷

@dibyom
Copy link
Member

dibyom commented May 19, 2023

How does 3. work for existing resources that have already been created?

@lbernick
Copy link
Member Author

@wlynch I just want to clarify that we plan to move off enable-api-fields in favor of per-feature flags to avoid these problems-- it's just not clear how long this will take.

I’m a bit skeptical of 6 - moving users from stable to beta feature flags by default feels like it may end up causing more problems.

There's some discussion of this on #6592. The general consensus seems to be that setting beta flags by default for right now will help address confusion with migration to v1, and that the best solution is to move to per-feature flags as soon as we can. Can you elaborate on the problems you're anticipating?

as you pointed out if you set the stable flag you’re still stuck not able to migrate

This problem doesn't affect users who use enable-api-fields=stable and are using only stable features; i.e. they will not experience problems upgrading or when we swap to v1 as our storage version. This problem only affects users who are setting "enable-api-fields" to stable, but also using beta features in beta CRDs. If they are already using beta features, it's likely that they'd want to continue doing so, so we think setting the default value to beta will help mitigate the impact on these users.

I think we should generally prioritize compatibility of stable over beta, though I understand this is a bit of a special case since it sounds like most (all?) of our usage for v1beta1 is using beta today.

I'm not sure exactly what you mean here. Is the concern that we are swapping from disabling beta features by default in GA CRDs to enabling them by default? I anticipate that most people are using beta CRDs with enable-api-fields set to stable.

A local annotation overriding a global default is a common enough pattern that I'm not particularly worried about being confusing, and also gives control to authors to opt-out of the behavior as well.

I disagree for a few reasons:

  • Instead of the user setting an annotation for a certain resource where they'd like to override defaults, we'd be automatically applying this annotation to all CRDs. If a cluster operator wanted to change the value of enable-api-fields, they'd also have to update all existing CRDs.
  • Currently, the cluster operator is the person who controls whether beta features are allowed. This gives pipeline and task authors the ability to override the cluster operator's decisions.
  • The way we'd actually perform this validation at runtime is confusing. For example, if a CRD was originally created as beta with "beta" fields, and a cluster operator has since changed the default to "stable" fields, we'd want to allow this. But if it was originally created as stable with "beta" fields, and the cluster operator has since changed the default to "stable" fields, that wouldn't be allowed. enable-api-fields has already proved to be quite confusing, and I think trying to explain this to our users will be hard.

If I'm misunderstanding your idea for part 3 please correct me!

@wlynch
Copy link
Member

wlynch commented May 19, 2023

How does 3. work for existing resources that have already been created?

The thought was only apply it on conversion from v1beta1 -> v1. (I may be misunderstanding how CRD conversion works though, so lmk if this isn't feasible).

Is the concern that we are swapping from disabling beta features by default in GA CRDs to enabling them by default?

Yes. More specifically:

  1. opting people into a beta track which users might not realize/know the affects of or may not want (I operate under the assumption that most users don't read the release notes 😅).
  2. breaking people who thought they were using stable but were actually using beta without realizing - even if they were using beta, we have compatibility for beta features as well.

I think what I'd like to see is if we can scope the change down to only the resources it affects. 🤔

Currently, the cluster operator is the person who controls whether beta features are allowed. This gives pipeline and task authors the ability to override the cluster operator's decisions.

👍 This makes sense, so SGTM to not proceed with this option as-is.

@JeromeJu
Copy link
Member

breaking people who thought they were using stable but were actually using beta without realizing - even if they were using beta, we have compatibility for beta features as well.

I would like to ➕ on the aim of users not to be affected nor realized for the version swap and also clarify for option #6, that the goal was also to not to have AIs for end users. Since we are now defaulting stable in v1beta1 as stored version at

enable-api-fields: "stable"
, we are assuming that moving to beta in v1 after the migration should not lead to any breaking changes as default.

For beta features ie. resolvers/resolutions, as they are defaulted to be apparent in v1beta1 as stable, we have planned to move forward for defaulting the enable-api-fields to beta in WG to not to require users to patch feature flags to have features opt-in as default before the migration.

@lbernick
Copy link
Member Author

I brought up one more idea to address this in today's API WG, which I think is the best option found so far. It is similar to option 1, but separates validation for beta features from the rest of pipeline/task spec validation. This is implemented in #6701.

@afrittoli
Copy link
Member

It seems to me that the problem comes from implicitly allowing beta features for v1beta1 resources.
Since the version stored in only one through, that does not work.
The current state might be a consequence of the fact that we did not have a stable API for a long time and we didn't have stable value for feature flags either, which meant that when a feature reached beta it became available by default.

I think we should stop doing that and only rely on the config map setting. If the config map is set to stable, beta features won't be available, it doesn't matter the version with which a resource was originally defined.
After all, even if the naming is similar, the version of an API is a different thing from the version of a feature.

The implication of this may affect users. Once we start storing V1, users that want to continue to use beta features will have to set the config map to beta.
We can make that very visible in the release notes.

@lbernick
Copy link
Member Author

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

There was some interest in option #3 from #6616 (comment) in API WG this week. I've opened #6716 as a draft PR demonstrating what this could look like. However, I still strongly believe that #6701 is a better approach, for reasons detailed in the PR description of #6716.

@afrittoli
Copy link
Member

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

I think this is the most sustainable option moving forward.
If we start working on a v2alpha1 in future, would we want alpha features to be enabled there by default?
A v2alpha1 would be stored as v1 anyways, so we would have the same problem again unless we change the design of the enable-api-fields flag.

Perhaps we could combine it with some strategy to avoid breaking existing users, e.g.:

  • deprecate the current behaviour of the flag, and ask users to start setting the config to beta if they want to use beta features, regardless of the API version we are using
  • change the conversion logic to attach an annotation to the resource with the original API version
  • alter the validation logic to allow beta feature when the annotation is set to v1beta1 (only), regardless of the value of enable-api-fields
  • we can keep this logic around for 6 months, or as long as v1beta1 exists, but at least we won't run into this issue for future API versions

There was some interest in option #3 from #6616 (comment) in API WG this week. I've opened #6716 as a draft PR demonstrating what this could look like. However, I still strongly believe that #6701 is a better approach, for reasons detailed in the PR description of #6716.

About #6701, validation of beta features for remote resources is something we need for sure, it sounds like a bug if we are not doing that today - perhaps that could be a PR on its own.
Apart from that, it's not 100% clear to me why that works - I understand that on creation the original version is still there, but on update wouldn't we get the stored version already? ...unless the convertion webhook gives us back the original version already?

If #6701 works, I would still consider decoupling the API version from the feature version, and we could do that for future versions only, like I proposed above, to avoid breaking users.

@lbernick
Copy link
Member Author

@afrittoli thanks for your response!

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

I think this is the most sustainable option moving forward. If we start working on a v2alpha1 in future, would we want alpha features to be enabled there by default? A v2alpha1 would be stored as v1 anyways, so we would have the same problem again unless we change the design of the enable-api-fields flag.

There's some discussion about this in #6592. There seems to be consensus that our current approach with enable-api-fields is confusing/causing problems, and we should decouple feature versioning from api versioning, as you're suggesting. I'm hoping to keep the discussion of how we do that contained to that issue, but the approach that has the most buy-in at the moment is 1. short term, set "enable-api-fields" to "beta" by default, and 2. move to per-feature flags.

#6701 is just looking for a short term way to make progress on our v1 swap without creating breaking changes. I absolutely want to move to per-feature flags or another solution that decouples api versioning from feature versioning-- I just don't want that to block our v1 storage swap, and I don't think it should have to, since the storage swap shouldn't include user-facing changes.

Perhaps we could combine it with some strategy to avoid breaking existing users, e.g.:

  • deprecate the current behaviour of the flag, and ask users to start setting the config to beta if they want to use beta features, regardless of the API version we are using
  • change the conversion logic to attach an annotation to the resource with the original API version
  • alter the validation logic to allow beta feature when the annotation is set to v1beta1 (only), regardless of the value of enable-api-fields
  • we can keep this logic around for 6 months, or as long as v1beta1 exists, but at least we won't run into this issue for future API versions

It might be better to discuss this type of strategy on #6592. #6701 aims to address the issue of referenced tasks/pipelines when swapping to v1 without creating any user facing breakages, but #6592 is more geared towards the goal of decoupling feature versioning from API versioning which it seems there is consensus on.

About #6701, validation of beta features for remote resources is something we need for sure, it sounds like a bug if we are not doing that today - perhaps that could be a PR on its own.

👍 -- PTAL at #6719, which is a refactoring PR that's the first step here.

Apart from that, it's not 100% clear to me why that works - I understand that on creation the original version is still there, but on update wouldn't we get the stored version already? ...unless the convertion webhook gives us back the original version already?

I'm not sure I understand your question exactly. The problem is that in the reconciler, we're calling TaskSpec.Validate after converting to the storage version, but once our storage version is v1, existing definitions of beta resources will fail this call to TaskSpec.Validate. #6701 moves beta feature validation out of TaskSpec.Validate, and applies it only when a v1 resource is created on the cluster or referenced as a remote pipeline, not after conversion.

If #6701 works, I would still consider decoupling the API version from the feature version, and we could do that for future versions only, like I proposed above, to avoid breaking users.

100% agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants