Versioned validation of referenced Pipelines/Tasks #6616

lbernick · 2023-05-03T14:40:56Z

We currently allow users to mix API versions of TaskRuns and the Tasks they reference, and likewise for PipelineRuns and the Pipelines they reference. We convert any referenced Tasks and Pipelines into the storage version of the API.

Later, after setting defaults and performing other operations, we validate the referenced object's spec.

For Pipelines, we do this by calling pipelineSpec.Validate, which validates the referenced Pipeline as if its API version is the storage version.

For Tasks, we do not call taskSpec.Validate; instead, we have customized validation functions that aren't defined in pkg/apis; e.g. ValidateResolvedTask, validateTaskSpecRequestResources, and a few others. The only validation function used from the apis package is taskSpec.ValidateParamArrayIndex, which again, validates the referenced Task as if its API version is the storage version. This is the cause of #6607.

This presents a problem when swapping the storage version of the API to v1. If a user previously defined a v1beta1 referenced Task or Pipeline, it gets converted to v1 and validated as if it were a v1 object. However, if the referenced object used beta features, and "enable-api-fields" is set to "stable", validation fails. This means that swapping the storage version to v1 would cause PipelineRuns and TaskRuns that previously worked to fail. Example logs from #6608.

We should perform validation before we convert the object into the storage version.

Other options:

set "enable-api-fields" to "beta" by default. This addresses the specific problem for users who accept the default behavior but doesn't address this problem for users who set "enable-api-fields" to "stable". (I'm not sure why someone using v1beta1 would actively choose to set "stable" if the default is "beta" but it still seems like a bug.) This is the easiest option.
don't convert referenced tasks/pipelines to the storage version: this would involve significantly reworking the codebase

The text was updated successfully, but these errors were encountered:

dibyom · 2023-05-03T15:21:46Z

We should perform validation before we convert the object into the storage version.

I'm not sure if this is always feasible - we validate results/params with values once they have been resolved

lbernick · 2023-05-03T19:38:59Z

We should perform validation before we convert the object into the storage version.

I'm not sure if this is always feasible - we validate results/params with values once they have been resolved

I think this should be fine; I should clarify that I don't think we need to do all validation before converting; only validation of the remote Task/Pipeline spec.

lbernick · 2023-05-03T19:51:19Z

I think a bigger problem might be trusted resources: we can only validate trusted resources before defaults are set, since the task/pipeline signature is computed before defaults are set. So the order of ops ends up needing to be 1. verification 2. set defaults 3. validate 4. convert to storage api version.
FYI @Yongxuanzhang

lbernick · 2023-05-11T14:45:34Z

@dibyom on second thought you might be right that we need to address propagated param logic first. I've created #6647 to track this.

lbernick · 2023-05-16T16:03:10Z

Update: I realized a new issue with local PipelineRefs/TaskRefs: A user can create a perfectly valid v1beta1 Pipeline/Task with enable-api-fields = stable, using beta features. This gets converted into the storage version of the API, and we don't know what API version it was originally created with. When swapping to storage version v1, this can break existing local PipelineRefs/TaskRefs.

There are multiple ways to address this problem, each with tradeoffs.

The approach taken in Validate specs of remote Pipelines before conversion #6645: Perform validation of remote pipelines before conversion to the storage version. Skip validation of local Pipelines, since they were validated on creation by the webhook.

Pros: can swap storage version without breaking any Pipelines
Cons: May end up reconciling invalid Pipelines for local PipelineRefs. Treats local Pipelines differently from remote Pipelines.

Perform validation of remote pipelines before conversion to the storage version. Perform validation of local Pipelines using the storage version of the API.

Pros: fail fast. Do not reconcile invalid Pipelines
Cons: can break v1beta1 local pipelines using beta features with "enable-api-fields" set to stable when we swap the storage version of the API. Local Pipelines treated differently than remote pipelines.

an idea from @wlynch (thank you!): Add an annotation to conversion logic signifying the value of "enable-api-fields" to use, based on the original version of the local Pipeline. If a Pipeline has this annotation set, use this value when validating beta features.

Pros: fail fast; do not reconcile invalid Pipelines. Can swap storage version without breaking any Pipelines.
Cons: enable-api-fields is set to whatever version was used when the Pipeline was first created. Changes by the cluster operator have no effect for existing Pipelines.

Separate enable-api-fields validation from the rest of our validation. Apply it only when a Pipeline is created or a remote pipeline is fetched, not when a local Pipeline is fetched.

Pros: fail fast; do not reconcile invalid Pipelines. Can swap storage version without breaking any Pipelines. May be easier to understand than solution 3.
Cons: enable-api-fields is effectively set to whatever version was used when the Pipeline was first created. Changes by the cluster operator have no effect for existing Pipelines. Requires significant refactoring.

Use per-feature flags (Per Feature Feature Flags #5632). We plan on doing this anyway, but not before swapping the storage version to v1.

Pros: Addresses the problem without the drawbacks associated with the other solutions. We can decide whether per-feature flags should gate implementation as well as existence of fields.
Cons: Pushes v1 timeline back significantly (this still needs design work). Need to make these changes for every beta feature to unblock v1.

Set enable-api-fields to "beta" by default (Decoupling API versioning and Feature versioning for features turned on by default #6592). We plan on doing this anyway.

Pros: super easy, reduces unexpected surprises when users swap their CRDs to v1
Cons: still breaks v1beta1 pipelines when we swap the storage version to v1 for users who have chosen to set "enable-api-fields" to "stable"

Would love to get some opinions on people's preferred approaches!

lbernick · 2023-05-18T18:51:08Z

Chatted with @dibyom and @JeromeJu. We believe option 6 is the best, and hope to mitigate the impact with clear release notes.

Options 1 and 2 treat local Pipelines differently from remote pipelines. I also don't think we should skip validation for local Pipelines, as in option 1.

I think options 3 and 4 will be very difficult to explain to cluster operators. It'll be hard to understand what value of "enable-api-fields" is being applied and what happens when you change the cluster settings.

We should still pursue and prioritize option 5. It's worth noting our only beta features right now are resolution, object params/results, and array results/array indexing. I think we can get 3 per feature flags done reasonably quickly.

Still would appreciate more feedback!

wlynch · 2023-05-19T13:32:12Z

I’m a bit skeptical of 6 - moving users from stable to beta feature flags by default feels like it may end up causing more problems. 😅 I don't think we want to make stable opt-in, and as you pointed out if you set the stable flag you’re still stuck not able to migrate. I think we should generally prioritize compatibility of stable over beta, though I understand this is a bit of a special case since it sounds like most (all?) of our usage for v1beta1 is using beta today.

For existing v1beta1 resources already under this behavior it seems fine and needed to ensure compatibility, but I’m not sure it makes sense to do for all new resources as well.
The idea for 3 was to carry over the feature flag values during conversion so that they only apply to existing resources that may have relied on the behavior. A local annotation overriding a global default is a common enough pattern that I'm not particularly worried about being confusing, and also gives control to authors to opt-out of the behavior as well. If cluster-wide control is a strict requirement, perhaps there can be an additional feature flag to disable the behavior? 🤷

dibyom · 2023-05-19T14:58:09Z

How does 3. work for existing resources that have already been created?

lbernick · 2023-05-19T15:28:24Z

@wlynch I just want to clarify that we plan to move off enable-api-fields in favor of per-feature flags to avoid these problems-- it's just not clear how long this will take.

I’m a bit skeptical of 6 - moving users from stable to beta feature flags by default feels like it may end up causing more problems.

There's some discussion of this on #6592. The general consensus seems to be that setting beta flags by default for right now will help address confusion with migration to v1, and that the best solution is to move to per-feature flags as soon as we can. Can you elaborate on the problems you're anticipating?

as you pointed out if you set the stable flag you’re still stuck not able to migrate

This problem doesn't affect users who use enable-api-fields=stable and are using only stable features; i.e. they will not experience problems upgrading or when we swap to v1 as our storage version. This problem only affects users who are setting "enable-api-fields" to stable, but also using beta features in beta CRDs. If they are already using beta features, it's likely that they'd want to continue doing so, so we think setting the default value to beta will help mitigate the impact on these users.

I think we should generally prioritize compatibility of stable over beta, though I understand this is a bit of a special case since it sounds like most (all?) of our usage for v1beta1 is using beta today.

I'm not sure exactly what you mean here. Is the concern that we are swapping from disabling beta features by default in GA CRDs to enabling them by default? I anticipate that most people are using beta CRDs with enable-api-fields set to stable.

A local annotation overriding a global default is a common enough pattern that I'm not particularly worried about being confusing, and also gives control to authors to opt-out of the behavior as well.

I disagree for a few reasons:

Instead of the user setting an annotation for a certain resource where they'd like to override defaults, we'd be automatically applying this annotation to all CRDs. If a cluster operator wanted to change the value of enable-api-fields, they'd also have to update all existing CRDs.
Currently, the cluster operator is the person who controls whether beta features are allowed. This gives pipeline and task authors the ability to override the cluster operator's decisions.
The way we'd actually perform this validation at runtime is confusing. For example, if a CRD was originally created as beta with "beta" fields, and a cluster operator has since changed the default to "stable" fields, we'd want to allow this. But if it was originally created as stable with "beta" fields, and the cluster operator has since changed the default to "stable" fields, that wouldn't be allowed. enable-api-fields has already proved to be quite confusing, and I think trying to explain this to our users will be hard.

If I'm misunderstanding your idea for part 3 please correct me!

wlynch · 2023-05-19T16:38:47Z

How does 3. work for existing resources that have already been created?

The thought was only apply it on conversion from v1beta1 -> v1. (I may be misunderstanding how CRD conversion works though, so lmk if this isn't feasible).

Is the concern that we are swapping from disabling beta features by default in GA CRDs to enabling them by default?

Yes. More specifically:

opting people into a beta track which users might not realize/know the affects of or may not want (I operate under the assumption that most users don't read the release notes 😅).
breaking people who thought they were using stable but were actually using beta without realizing - even if they were using beta, we have compatibility for beta features as well.

I think what I'd like to see is if we can scope the change down to only the resources it affects. 🤔

Currently, the cluster operator is the person who controls whether beta features are allowed. This gives pipeline and task authors the ability to override the cluster operator's decisions.

👍 This makes sense, so SGTM to not proceed with this option as-is.

JeromeJu · 2023-05-19T16:56:47Z

breaking people who thought they were using stable but were actually using beta without realizing - even if they were using beta, we have compatibility for beta features as well.

I would like to ➕ on the aim of users not to be affected nor realized for the version swap and also clarify for option #6, that the goal was also to not to have AIs for end users. Since we are now defaulting stable in v1beta1 as stored version at

pipeline/config/config-feature-flags.yaml

Line 75 in 96212a1

enable-api-fields: "stable"

, we are assuming that moving to beta in v1 after the migration should not lead to any breaking changes as default.

For beta features ie. resolvers/resolutions, as they are defaulted to be apparent in v1beta1 as stable, we have planned to move forward for defaulting the enable-api-fields to beta in WG to not to require users to patch feature flags to have features opt-in as default before the migration.

lbernick · 2023-05-22T19:43:14Z

I brought up one more idea to address this in today's API WG, which I think is the best option found so far. It is similar to option 1, but separates validation for beta features from the rest of pipeline/task spec validation. This is implemented in #6701.

afrittoli · 2023-05-24T22:12:45Z

It seems to me that the problem comes from implicitly allowing beta features for v1beta1 resources.
Since the version stored in only one through, that does not work.
The current state might be a consequence of the fact that we did not have a stable API for a long time and we didn't have stable value for feature flags either, which meant that when a feature reached beta it became available by default.

I think we should stop doing that and only rely on the config map setting. If the config map is set to stable, beta features won't be available, it doesn't matter the version with which a resource was originally defined.
After all, even if the naming is similar, the version of an API is a different thing from the version of a feature.

The implication of this may affect users. Once we start storing V1, users that want to continue to use beta features will have to set the config map to beta.
We can make that very visible in the release notes.

lbernick · 2023-05-25T14:44:43Z

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

There was some interest in option #3 from #6616 (comment) in API WG this week. I've opened #6716 as a draft PR demonstrating what this could look like. However, I still strongly believe that #6701 is a better approach, for reasons detailed in the PR description of #6716.

afrittoli · 2023-05-25T17:34:46Z

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

I think this is the most sustainable option moving forward.
If we start working on a v2alpha1 in future, would we want alpha features to be enabled there by default?
A v2alpha1 would be stored as v1 anyways, so we would have the same problem again unless we change the design of the enable-api-fields flag.

Perhaps we could combine it with some strategy to avoid breaking existing users, e.g.:

deprecate the current behaviour of the flag, and ask users to start setting the config to beta if they want to use beta features, regardless of the API version we are using
change the conversion logic to attach an annotation to the resource with the original API version
alter the validation logic to allow beta feature when the annotation is set to v1beta1 (only), regardless of the value of enable-api-fields
we can keep this logic around for 6 months, or as long as v1beta1 exists, but at least we won't run into this issue for future API versions

There was some interest in option #3 from #6616 (comment) in API WG this week. I've opened #6716 as a draft PR demonstrating what this could look like. However, I still strongly believe that #6701 is a better approach, for reasons detailed in the PR description of #6716.

About #6701, validation of beta features for remote resources is something we need for sure, it sounds like a bug if we are not doing that today - perhaps that could be a PR on its own.
Apart from that, it's not 100% clear to me why that works - I understand that on creation the original version is still there, but on update wouldn't we get the stored version already? ...unless the convertion webhook gives us back the original version already?

If #6701 works, I would still consider decoupling the API version from the feature version, and we could do that for future versions only, like I proposed above, to avoid breaking users.

lbernick · 2023-05-25T17:55:14Z

@afrittoli thanks for your response!

@afrittoli just to make sure I understand correctly, it sounds like you're suggesting updating how we validate enable-api-fields, so that beta features cannot be used in beta CRDs when "enable-api-fields" is set to "stable"? That would solve this problem although I'm concerned about the breaking changes.

I think this is the most sustainable option moving forward. If we start working on a v2alpha1 in future, would we want alpha features to be enabled there by default? A v2alpha1 would be stored as v1 anyways, so we would have the same problem again unless we change the design of the enable-api-fields flag.

There's some discussion about this in #6592. There seems to be consensus that our current approach with enable-api-fields is confusing/causing problems, and we should decouple feature versioning from api versioning, as you're suggesting. I'm hoping to keep the discussion of how we do that contained to that issue, but the approach that has the most buy-in at the moment is 1. short term, set "enable-api-fields" to "beta" by default, and 2. move to per-feature flags.

#6701 is just looking for a short term way to make progress on our v1 swap without creating breaking changes. I absolutely want to move to per-feature flags or another solution that decouples api versioning from feature versioning-- I just don't want that to block our v1 storage swap, and I don't think it should have to, since the storage swap shouldn't include user-facing changes.

Perhaps we could combine it with some strategy to avoid breaking existing users, e.g.:

deprecate the current behaviour of the flag, and ask users to start setting the config to beta if they want to use beta features, regardless of the API version we are using

change the conversion logic to attach an annotation to the resource with the original API version

alter the validation logic to allow beta feature when the annotation is set to v1beta1 (only), regardless of the value of enable-api-fields

we can keep this logic around for 6 months, or as long as v1beta1 exists, but at least we won't run into this issue for future API versions

It might be better to discuss this type of strategy on #6592. #6701 aims to address the issue of referenced tasks/pipelines when swapping to v1 without creating any user facing breakages, but #6592 is more geared towards the goal of decoupling feature versioning from API versioning which it seems there is consensus on.

About #6701, validation of beta features for remote resources is something we need for sure, it sounds like a bug if we are not doing that today - perhaps that could be a PR on its own.

👍 -- PTAL at #6719, which is a refactoring PR that's the first step here.

Apart from that, it's not 100% clear to me why that works - I understand that on creation the original version is still there, but on update wouldn't we get the stored version already? ...unless the convertion webhook gives us back the original version already?

I'm not sure I understand your question exactly. The problem is that in the reconciler, we're calling TaskSpec.Validate after converting to the storage version, but once our storage version is v1, existing definitions of beta resources will fail this call to TaskSpec.Validate. #6701 moves beta feature validation out of TaskSpec.Validate, and applies it only when a v1 resource is created on the cluster or referenced as a remote pipeline, not after conversion.

If #6701 works, I would still consider decoupling the API version from the feature version, and we could do that for future versions only, like I proposed above, to avoid breaking users.

100% agree!

lbernick added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 3, 2023

lbernick added this to Pipelines V1 May 3, 2023

lbernick moved this to Todo in Pipelines V1 May 3, 2023

This was referenced May 3, 2023

Decoupling API versioning and Feature versioning for features turned on by default #6592

Closed

Release v1 versions of Task, TaskRun, Pipeline, and PipelineRun CRDs #5541

Closed

lbernick self-assigned this May 3, 2023

lbernick mentioned this issue May 3, 2023

ValidateParamArrayIndex for taskSpec should check beta flag. #6607

Closed

lbernick mentioned this issue May 3, 2023

Cleanup: Move array indexing validation out of apis package #6617

Merged

4 tasks

lbernick mentioned this issue May 3, 2023

Trusted resources works only for storage version of the API #6618

Closed

lbernick added this to the Pipelines v0.48 milestone May 8, 2023

This was referenced May 8, 2023

Per Feature Feature Flags #5632

Closed

Validate specs of remote Pipelines before conversion #6645

Closed

Clean up propagated parameter/workspace validation #6647

Closed

lbernick mentioned this issue May 12, 2023

Split array param indexing validation between reconciler and webhook #6652

Merged

5 tasks

lbernick mentioned this issue May 22, 2023

Validate beta features only when v1 Tasks and Pipelines are defined #6701

Merged

4 tasks

jerop modified the milestones: Pipelines v0.48, Pipelines v0.49 May 23, 2023

lbernick mentioned this issue May 25, 2023

Consolidate validation for Task/Pipeline beta features #6719

Merged

4 tasks

This was referenced May 25, 2023

Add validation for beta features in v1 remote Tasks/Pipelines #6725

Merged

Enable beta features by default #6732

Merged

tekton-robot closed this as completed in #6701 May 30, 2023

github-project-automation bot moved this from Todo to Done in Pipelines V1 May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioned validation of referenced Pipelines/Tasks #6616

Versioned validation of referenced Pipelines/Tasks #6616

lbernick commented May 3, 2023 •

edited

Loading

dibyom commented May 3, 2023

lbernick commented May 3, 2023

lbernick commented May 3, 2023

lbernick commented May 11, 2023

lbernick commented May 16, 2023 •

edited

Loading

lbernick commented May 18, 2023

wlynch commented May 19, 2023

dibyom commented May 19, 2023

lbernick commented May 19, 2023

wlynch commented May 19, 2023

JeromeJu commented May 19, 2023

lbernick commented May 22, 2023

afrittoli commented May 24, 2023

lbernick commented May 25, 2023

afrittoli commented May 25, 2023

lbernick commented May 25, 2023

Versioned validation of referenced Pipelines/Tasks #6616

Versioned validation of referenced Pipelines/Tasks #6616

Comments

lbernick commented May 3, 2023 • edited Loading

dibyom commented May 3, 2023

lbernick commented May 3, 2023

lbernick commented May 3, 2023

lbernick commented May 11, 2023

lbernick commented May 16, 2023 • edited Loading

lbernick commented May 18, 2023

wlynch commented May 19, 2023

dibyom commented May 19, 2023

lbernick commented May 19, 2023

wlynch commented May 19, 2023

JeromeJu commented May 19, 2023

lbernick commented May 22, 2023

afrittoli commented May 24, 2023

lbernick commented May 25, 2023

afrittoli commented May 25, 2023

lbernick commented May 25, 2023

lbernick commented May 3, 2023 •

edited

Loading

lbernick commented May 16, 2023 •

edited

Loading