Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a formal policy for maintaining cloudproviders #5198

Merged
merged 1 commit into from
Dec 14, 2022

Conversation

MaciekPytel
Copy link
Contributor

This is an initial proposal for a policy on cloudprovider maintenance discussed in a recent sig-meeting.

The policy largely codifies what we've already been doing for years (including the requirements we've already imposed on new providers). It also introduces a new 'cloudprovider maintenance request' mechanism, following the general idea previously discussed in the sig-meeting.

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind documentation

Does this PR introduce a user-facing change?

NONE

This policy is only relevant for CA / cloudprovider developers and not its users.

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 19, 2022
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2022
@MaciekPytel
Copy link
Contributor Author

/hold
Holding to avoid any accidental merge before the discussion is done.

sig-leads for approval: @gjtempleton @mwielgus

cc: @ericrrath @jackfrancis @piepmatz @gajicdev @7fELF @ringtail @jtherin @elmiko @Andrius521 @nilo19 @ellistarn @gjtempleton @davidjumani @birdiesanders @DMajrekar @kevin-wangzefeng @dbonfigli @ddymko @thirdeyenick @displague @andrewsykim @zalmarge @schegi @enxebre @d-mo @alphajc @pawelkuc @RealHarshThakur @towca @jaypipes @drmorr0 @avorima @detiber @Sh4d1 @feiskyer @vishalanarase @PhilippeChepy @Jeffwan @OriHoch @louisportay @hardikdr @jlamillan @hello2mao @pierre-emmanuelJ @x13n @mrajashree @tghartland @remyleone @shysank @RainbowMango @LKaemmerling @ctrox @cprivitere @4ND3R50N @v-pap @ArturasRa @deitch @arunmk @aleksandra-malinowska @MorrisLaw @marwanad

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2022
@MaciekPytel
Copy link
Contributor Author

At least the github UI doesn't seem to be rendering some of the logins in the comment above correctly (is there maximum logins per line limit?) and I'm not sure if the notifications went out. Just in case:

cc: @RainbowMango @LKaemmerling @ctrox @cprivitere @4ND3R50N @v-pap @ArturasRa @deitch @arunmk @aleksandra-malinowska @MorrisLaw @marwanad

@deitch
Copy link
Contributor

deitch commented Sep 19, 2022

I don't think I got the alert for this, so thanks for @ cc'ing after.

This all looks good, glad to see it somewhat formalized.

The out-of-tree provider appears to be almost an after-thought, or a fallback. Is there any desire to move it entirely to out-of-tree, the same way, e.g. CCM is done?

@MaciekPytel
Copy link
Contributor Author

Thanks for confirming my suspicions on notification not working @deitch, I'll make sure to split it into multiple lines next time.

Re: external provider:

  • No plans to remove in-tree providers. There are things you can do in-tree that you can't do over grpc such as implementing custom processors to be used alongside your cloudprovider. Also, as far as I know, no one tried to run externalgrpc provider in a cluster with thousands of nodes and I doubt it would work. For comparison, it took us years to get GCE provider to reliably work at 5k node scale. This is purely a guess, but I'd imagine the proto would need to be updated to allow batching requests and the CA side of implementation would likely need much more caching and optimization. Given that all the very large clusters I know of run on existing in-tree providers, I don't know if anyone will be willing to invest that much effort into optimizing externalgrpc.
  • Based on the above, my recommendation to anyone building a new provider would be to start with externalgrpc if they just want to get it working with minimal effort. However, if you want some advanced, customized functionality and/or thousands of nodes scale and your're ready to invest a few months of fulltime effort, you're probably better off with in-tree.
  • I think you're right on externalgrpc sounding like a fallback / secondary option in the doc though, which is not what I intended. I'll update the doc to make external the first option. Thanks for the suggestion!

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me, i have a couple comments but no major changes.

by kubernetes/kubernetes repository and the same version of the library is
used by CA and Kubernetes. This requirement is mainly driven by
the problems with version conflicts in transitive dependencies we've
experienced in the past.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth mentioning that cloudproviders are welcome to carry dependencies in their directory if they need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 84 to 115
## Cloudprovider maintenance requirements

In order to allow code changes to Cluster Autoscaler that would require
non-trivial changes in cloudproviders this policy introduces _Cloudprovider
maintenance request_ (CMR) mechanism.

* CMR will be issued via a github issue tagging all
cloudprovider owners and describing the problem being solved and the changes
requested.
* CMR will clearly state the minor version in which the changes are expected
(ex. 1.26).
* CMR will need to be discussed on sig-autoscaling meeting and approved by
sig leads before being issued. It will also be announced on sig-autoscaling
slack channel and highlited in sig-autoscaling meeting notes.
* A CMR may be issued no later then [enhancements
freeze](https://github.com/kubernetes/sig-release/blob/master/releases/release_phases.md#enhancements-freeze)
of a given Kubernetes minor version.

Cloudprovider owners will be required to address CMR or request an exception via
the CMR github issue. A failure to take any action will result in cloudprovider
being considered abandoned and marking it as deprecated as described below.

### Empty maintenance request

If no CMRs are issued in a given minor release, core maintainers will issue an
_empty CMR_. The purpose of an empty CMR is to verify that cloudprovider owners
are still actively maintaining their integration. The only action required for
an empty CMR is replying on the github issue. Only one owner from each
cloudprovider needs to reply on the issue.

Empty CMR follows the same rules as any other CMR. In particular it needs to be
issued by enhancements freeze.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be nice to bring this part on a separate review since it proposes a new process.

Comment on lines 117 to 127
### Cloudprovider deprecation and deletion

If cloudprovider owners fail to take actions described above, the particular
integration will be marked as deprecated in the next CA minor release. A
deprecated cloudprovider will be completely removed after 1 year as per
[Kubernetes deprecation
policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-feature-or-behavior).

A deprecated cloudprovider may become maintained again if the owners become
active again or new owners step up. In order to regain maintained status any
outstanding CMRs will need to be addressed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might bring this on a separate review as well, but i think it's so useful to have that we should include in the primary pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a version of it, but moved part specific to CMR to a separate PR.

Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaciekPytel for this. I know I haven't been able to attend many SIG meetings this year and that the AWS cloud provider has a number of open issues/PRs on it. Will try to do better at filling these gaps.

Proposal looks reasonable and fair to me.

Comment on lines 102 to 104
Cloudprovider owners will be required to address CMR or request an exception via
the CMR github issue. A failure to take any action will result in cloudprovider
being considered abandoned and marking it as deprecated as described below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 seems fair to me.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypipes, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

built in.

An external cloudprovider implementation doesn't live in this repository and is
not a part of CA image. As such it is also not a subject to this policy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo. Should read:

As such it is also not subject to this policy.

not

As such it is also not a subject to this policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@x13n
Copy link
Member

x13n commented Nov 2, 2022

Could we make it a part of the policy to have non-empty approvers AND reviewers sections? I just sent out #5287 to update one of the OWNERS files, but if we require it, it should be here.

The policy largely codifies what we've already been doing for years
(including the requirements we've already imposed on new providers).
@MaciekPytel
Copy link
Contributor Author

@x13n, @deitch - addressed your comments. I've also split the CMR part into a separate PR as per @elmiko comment. This one is now essentially writing down what we've been doing for a while anyway.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

thanks @MaciekPytel

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2022
@x13n
Copy link
Member

x13n commented Dec 9, 2022

Can we cancel the hold on this one now?

@gjtempleton
Copy link
Member

/lgtm

@MaciekPytel
Copy link
Contributor Author

/hold cancel

This has been discussed multiple times on sig meeting now and it's been lgtm-ed by sig lead, so I think it's fine to let it merge

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2022
@k8s-ci-robot k8s-ci-robot merged commit c296a14 into kubernetes:master Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants