enhancements/update/targeted-update-edge-blocking: Propose a new enhancement #821

wking · 2021-07-02T04:02:18Z

This enhancement proposes a mechanism for blocking edges for the subset of clusters considered vulnerable to known issues with a particular update or target release.

This is the third iteration of this proposal, replacing #426 (update-service-side decisions based on Telemetry queries) and #737 (update-service-side decisions based on client-supplied query parameters). I'm now recommending cluster-side decisions, with the update service declaring conditional edges and telling the cluster-side tools how to decide if the cluster is vulnerable to the known bugs impacting its update decisions.

enhancements/update/targeted-update-edge-blocking.md

sdodson · 2021-07-07T17:51:56Z

@wking I'm unsure if we should add this to alternatives section or where you'd like to address it, however I think it'd be great to think through how we can enable future cluster agnostic graph viewers. I'm a bit worried that the suggested implementation relies a bit too much on client side merging of multiple graphs which would complicate our ability to visualize edges into Blocked, Conditionally Blocked, In Progress (assuming phased rollouts) buckets. Such viewers wouldn't need to be capable of evaluating any conditional blocks but would be expected to present the conditions.

enhancements/update/targeted-update-edge-blocking.md

sdodson

Still a few open threads with questions but I think this is probably ready to have WIP removed. Would like to see affirmative support from other CVO maintainers that the level of complexity in the CVO is necessary.

enhancements/update/targeted-update-edge-blocking.md

jottofar · 2021-07-26T20:54:37Z

Still a few open threads with questions but I think this is probably ready to have WIP removed. Would like to see affirmative support from other CVO maintainers that the level of complexity in the CVO is necessary.

I don't find the suggested CVO changes overly complex. Seems to me a reasonable outgrowth of what we already have in CVO.

enhancements/update/targeted-update-edge-blocking.md

LalatenduMohanty · 2021-08-27T20:28:07Z

We need a section for what it means for a disconnected cluster (as they might not have telemetry), when we release a cincinnati-graph-data image for OSUS operator setup.

Implementing targeted edge blocking as described in [1]. Except: * Some properties which are optional in the enhancement are required here. I think this is strictly worse handling for the cluster-version operator, because we are just getting this data from the upstream update service's Cincinnati JSON, and if the upstream says "here's a risk, but I am not linking a URI", and the CVO subsequently rejects the entire update payload as invalid, then downstream consumers are not hearing about the reason/message/matchingRules that the upstream may have been passing along. But I'm not an API approver, and David is asking for the +required in [2,3,4,5]. I dunno why he isn't finding my argument's convincing, but I'm out of time to continue arguing my position. * Similarly, the ClusterCondition.Type property has an enum, because David requires it [6]. This means that the CVO will need to self-censor if there are unrecognized types (because we don't want our ClusterVersion writes rejected on API-server validation failures). That means that instead of saying "... and there was another rule $UNRECOGNIZED_TYPE that I don't know how to evaluate", the CVO will have to silently drop that rule. And if that's the only rule, the CVO will have to silently drop that risk and conditional update. David is fine with that, even if I'm not on board, and he's the API approver, so that's what we'll do. Swagger, deepcopy, and CRD changes were generated with: $ make update [1]: openshift/enhancements#821 [2]: openshift#1011 (comment) [3]: openshift#1011 (comment) [4]: openshift#1011 (comment) [5]: openshift#1011 (comment) [6]: openshift#1011 (comment)

Implementing targeted edge blocking as described in [1]. Except: * Some properties which are optional in the enhancement are required here. I think this is strictly worse handling for the cluster-version operator, because we are just getting this data from the upstream update service's Cincinnati JSON, and if the upstream says "here's a risk, but I am not linking a URI", and the CVO subsequently rejects the entire update payload as invalid, then downstream consumers are not hearing about the reason/message/matchingRules that the upstream may have been passing along. But I'm not an API approver, and David is asking for the +required in [2,3,4,5]. I dunno why he isn't finding my argument's convincing, but I'm out of time to continue arguing my position. * Similarly, the ClusterCondition.Type property has an enum, because David requires it [6]. This means that the CVO will need to self-censor if there are unrecognized types (because we don't want our ClusterVersion writes rejected on API-server validation failures). That means that instead of saying "... and there was another rule $UNRECOGNIZED_TYPE that I don't know how to evaluate", the CVO will have to silently drop that rule. And if that's the only rule, the CVO will have to silently drop that risk and conditional update. David is fine with that, even if I'm not on board, and he's the API approver, so that's what we'll do. I've also upcased the types to Always and PromQL; from David: Always and PromQL. We use CamelCase for enumerated values Swagger, deepcopy, and CRD changes were generated with: $ make update [1]: openshift/enhancements#821 [2]: openshift#1011 (comment) [3]: openshift#1011 (comment) [4]: openshift#1011 (comment) [5]: openshift#1011 (comment) [6]: openshift#1011 (comment)

Implementing targeted edge blocking as described in [1]. Except: * Some properties which are optional in the enhancement are required here. I think this is strictly worse handling for the cluster-version operator, because we are just getting this data from the upstream update service's Cincinnati JSON, and if the upstream says "here's a risk, but I am not linking a URI", and the CVO subsequently rejects the entire update payload as invalid, then downstream consumers are not hearing about the reason/message/matchingRules that the upstream may have been passing along. But I'm not an API approver, and David is asking for the +required in [2,3,4,5]. I dunno why he isn't finding my argument's convincing, but I'm out of time to continue arguing my position. * Similarly, the ClusterCondition.Type property has an enum, because David requires it [6]. This means that the CVO will need to self-censor if there are unrecognized types (because we don't want our ClusterVersion writes rejected on API-server validation failures). That means that instead of saying "... and there was another rule $UNRECOGNIZED_TYPE that I don't know how to evaluate", the CVO will have to silently drop that rule. And if that's the only rule, the CVO will have to silently drop that risk and conditional update. David is fine with that, even if I'm not on board, and he's the API approver, so that's what we'll do. I've also upcased the types to Always and PromQL; from David: Always and PromQL. We use CamelCase for enumerated values * history[].overrides became history[].acceptedRisks to satisfy David's concerns about misinterpretation [7]. Swagger, deepcopy, and CRD changes were generated with: $ make update [1]: openshift/enhancements#821 [2]: openshift#1011 (comment) [3]: openshift#1011 (comment) [4]: openshift#1011 (comment) [5]: openshift#1011 (comment) [6]: openshift#1011 (comment) [7]: openshift#1011 (comment)

Implementing targeted edge blocking as described in [1]. Except: * Some properties which are optional in the enhancement are required here. I think this is strictly worse handling for the cluster-version operator, because we are just getting this data from the upstream update service's Cincinnati JSON, and if the upstream says "here's a risk, but I am not linking a URI", and the CVO subsequently rejects the entire update payload as invalid, then downstream consumers are not hearing about the reason/message/matchingRules that the upstream may have been passing along. But I'm not an API approver, and David is asking for the +required in [2,3,4,5]. I dunno why he isn't finding my argument's convincing, but I'm out of time to continue arguing my position. * Similarly, the ClusterCondition.Type property has an enum, because David requires it [6]. This means that the CVO will need to self-censor if there are unrecognized types (because we don't want our ClusterVersion writes rejected on API-server validation failures). That means that instead of saying "... and there was another rule $UNRECOGNIZED_TYPE that I don't know how to evaluate", the CVO will have to silently drop that rule. And if that's the only rule, the CVO will have to silently drop that risk and conditional update. David is fine with that, even if I'm not on board, and he's the API approver, so that's what we'll do. I've also upcased the types to Always and PromQL; from David: Always and PromQL. We use CamelCase for enumerated values * history[].overrides became history[].acceptedRisks to satisfy David's concerns about misinterpretation [7]. * The server-side apply annotations documented in [8]: listType, listMapKey, patchMergeStrategy, and patchMergeKey. Swagger, deepcopy, and CRD changes were generated with: $ make update [1]: openshift/enhancements#821 [2]: openshift#1011 (comment) [3]: openshift#1011 (comment) [4]: openshift#1011 (comment) [5]: openshift#1011 (comment) [6]: openshift#1011 (comment) [7]: openshift#1011 (comment) [8]: https://kubernetes.io/docs/reference/using-api/server-side-apply/

…anded Following up on openshift#821 with the changes made in order to land openshift/api#1011.

…anded Following up on openshift#821 with the changes made in order to land openshift/api#1011. The graph-data properties remain optional, because graph-data has been used for unconditional blocks that do not declare risks today. The update service will continue to read those blocks, but will implement them by pruning 'edges'. So things that end up in 'conditionalEdges' will still need all the metadata populated to keep David happy ;).

So we can explain why we're blocking the different edges [1] (the promql -> PromQL type change is in flight with [2]). The zz in the filename for the vSphere hostname block ensures that one sorts last alphabetically, because it's the broadest block, and legacy Cincinnati services will prefer the final regular expression they load for a given 'to' target. This is basically a second attempt at my earlier 39bc2fb (blocked-edges/4.7.4*: Targeted edge blocking and version 1.1.0, 2021-09-01, openshift#1056), which ended up getting reverted in da1254a (Revert "blocked-edges/4.7.4*: Targeted edge blocking and version 1.1.0", 2021-09-21, openshift#1078) because the production Cincinnati was mad about the 1.1.0 version string. [3] taught Cincinnati to relax, and now that's live (and we never shipped any versions that would be mad about 1.1.0 to customers, the 4.6.0 Update Service operator went out before [4]). [1]: openshift/enhancements#821 [2]: openshift/enhancements#910 [3]: openshift/cincinnati#538 [4]: openshift/cincinnati#314

To keep us from forgetting to set 'url', etc. [1,2]. We'll probably want to have PromQL parser validation later [3], but I write Python a lot faster than I write Go. This should avoid obvious mistakes while the PromQL parser gets wired up to a validator. [1]: openshift/enhancements#821 [2]: openshift/enhancements#910 [3]: https://github.com/openshift/enhancements/blob/4412b51182ba6e007f1308a73e290e81fd66d092/enhancements/update/targeted-update-edge-blocking.md#promql-validation

…anded Following up on openshift#821 with the changes made in order to land openshift/api#1011. The graph-data properties remain optional, because graph-data has been used for unconditional blocks that do not declare risks today. The update service will continue to read those blocks, but will implement them by pruning 'edges'. So things that end up in 'conditionalEdges' will still need all the metadata populated to keep David happy ;).

So we can explain why we're blocking the different edges [1] (the promql -> PromQL type change is in flight with [2]). The zz in the filename for the vSphere hostname block ensures that one sorts last alphabetically, because it's the broadest block, and legacy Cincinnati services will prefer the final regular expression they load for a given 'to' target. This is basically a second attempt at my earlier 39bc2fb (blocked-edges/4.7.4*: Targeted edge blocking and version 1.1.0, 2021-09-01, openshift#1056), which ended up getting reverted in da1254a (Revert "blocked-edges/4.7.4*: Targeted edge blocking and version 1.1.0", 2021-09-21, openshift#1078) because the production Cincinnati was mad about the 1.1.0 version string. [3] taught Cincinnati to relax, and now that's live (and we never shipped any versions that would be mad about 1.1.0 to customers, the 4.6.0 Update Service operator went out before [4]). [1]: openshift/enhancements#821 [2]: openshift/enhancements#910 [3]: openshift/cincinnati#538 [4]: openshift/cincinnati#314

Implementing targeted edge blocking as described in [1]. Except: * Some properties which are optional in the enhancement are required here. I think this is strictly worse handling for the cluster-version operator, because we are just getting this data from the upstream update service's Cincinnati JSON, and if the upstream says "here's a risk, but I am not linking a URI", and the CVO subsequently rejects the entire update payload as invalid, then downstream consumers are not hearing about the reason/message/matchingRules that the upstream may have been passing along. But I'm not an API approver, and David is asking for the +required in [2,3,4,5]. I dunno why he isn't finding my argument's convincing, but I'm out of time to continue arguing my position. * Similarly, the ClusterCondition.Type property has an enum, because David requires it [6]. This means that the CVO will need to self-censor if there are unrecognized types (because we don't want our ClusterVersion writes rejected on API-server validation failures). That means that instead of saying "... and there was another rule $UNRECOGNIZED_TYPE that I don't know how to evaluate", the CVO will have to silently drop that rule. And if that's the only rule, the CVO will have to silently drop that risk and conditional update. David is fine with that, even if I'm not on board, and he's the API approver, so that's what we'll do. I've also upcased the types to Always and PromQL; from David: Always and PromQL. We use CamelCase for enumerated values * history[].overrides became history[].acceptedRisks to satisfy David's concerns about misinterpretation [7]. * The server-side apply annotations documented in [8]: listType, listMapKey, patchMergeStrategy, and patchMergeKey. Swagger, deepcopy, and CRD changes were generated with: $ make update [1]: openshift/enhancements#821 [2]: openshift#1011 (comment) [3]: openshift#1011 (comment) [4]: openshift#1011 (comment) [5]: openshift#1011 (comment) [6]: openshift#1011 (comment) [7]: openshift#1011 (comment) [8]: https://kubernetes.io/docs/reference/using-api/server-side-apply/

…endations" Users are occasionally confused by the "blocked edges" wording, which is internal shorthand. The fact that update recommendations are tracked as a directed graph (hence "edges") isn't all that relevant for most users, so talk about "updates" instead of "edges". While we occasionally drop update recommendations to reduce the chances that users update into serious issues, from [1]: When an update recommendation is supported, it remains supported for the life of 4.10, even if the update recommendation is later dropped or made conditional. Updating is still possible, we just don't think it's a good idea (unless maybe you're elbow deep in a support case, and a support tech decides to recommend an update based on context that is not available to the update service). So replace "blocked" with "not recommended", to convey the "still possible" aspect. Also ask for some more specifics in the impact statement template, as we gear up to support conditional updates fff9d3c (enhancements/update/targeted-update-edge-blocking: Propose a new enhancement, 2020-08-07, openshift#821). I haven't gone as far as "tell us what PromQL to use in our conditional recommendation", but I'm easing things in that direction. [1]: https://docs.openshift.com/container-platform/4.10/updating/understanding-upgrade-channels-release.html#upgrade-version-paths_understanding-upgrade-channels-releases

Because he approved the original pull request creating the directory [1]. [1]: openshift#821 (comment)

…endations" Users are occasionally confused by the "blocked edges" wording, which is internal shorthand. The fact that update recommendations are tracked as a directed graph (hence "edges") isn't all that relevant for most users, so talk about "updates" instead of "edges". While we occasionally drop update recommendations to reduce the chances that users update into serious issues, from [1]: When an update recommendation is supported, it remains supported for the life of 4.10, even if the update recommendation is later dropped or made conditional. Updating is still possible, we just don't think it's a good idea (unless maybe you're elbow deep in a support case, and a support tech decides to recommend an update based on context that is not available to the update service). So replace "blocked" with "not recommended", to convey the "still possible" aspect. Also ask for some more specifics in the impact statement template, as we gear up to support conditional updates fff9d3c (enhancements/update/targeted-update-edge-blocking: Propose a new enhancement, 2020-08-07, openshift#821). I haven't gone as far as "tell us what PromQL to use in our conditional recommendation", but I'm easing things in that direction. [1]: https://docs.openshift.com/container-platform/4.10/updating/understanding-upgrade-channels-release.html#upgrade-version-paths_understanding-upgrade-channels-releases

…endations" Users are occasionally confused by the "blocked edges" wording, which is internal shorthand. The fact that update recommendations are tracked as a directed graph (hence "edges") isn't all that relevant for most users, so talk about "updates" instead of "edges". While we occasionally drop update recommendations to reduce the chances that users update into serious issues, from [1]: When an update recommendation is supported, it remains supported for the life of 4.10, even if the update recommendation is later dropped or made conditional. Updating is still possible, we just don't think it's a good idea (unless maybe you're elbow deep in a support case, and a support tech decides to recommend an update based on context that is not available to the update service). So replace "blocked" with "not recommended", to convey the "still possible" aspect. Also ask for some more specifics in the impact statement template, as we gear up to support conditional updates fff9d3c (enhancements/update/targeted-update-edge-blocking: Propose a new enhancement, 2020-08-07, openshift#821). [1]: https://docs.openshift.com/container-platform/4.10/updating/understanding-upgrade-channels-release.html#upgrade-version-paths_understanding-upgrade-channels-releases

openshift-ci bot requested review from celebdor and cooktheryan July 2, 2021 04:02

This was referenced Jul 2, 2021

enhancements/update/targeted-update-edge-blocking: Propose a new enhancement #426

Closed

enhancements/update/targeted-update-edge-blocking: Client-side platform parameter #737

Closed

wking force-pushed the targeted-edge-blocking-cluster-side-prom branch 2 times, most recently from 95efba9 to e664c70 Compare July 2, 2021 04:39

wking changed the title ~~enhancements/update/targeted-update-edge-blocking: Propose a new enhancement~~ WIP: enhancements/update/targeted-update-edge-blocking: Propose a new enhancement Jul 2, 2021

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2021

wking force-pushed the targeted-edge-blocking-cluster-side-prom branch from e664c70 to 61c6c45 Compare July 2, 2021 05:14

sdodson reviewed Jul 2, 2021

View reviewed changes

wking force-pushed the targeted-edge-blocking-cluster-side-prom branch from 61c6c45 to a034a69 Compare July 2, 2021 18:28

bparees reviewed Jul 7, 2021

View reviewed changes