Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AEP] docs: add proposal for kwok provider #5869

Closed

Conversation

vadasambar
Copy link
Member

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

This is a proposal for implementing kwok provider. Check the files for more explanation.

Which issue(s) this PR fixes:

It will help fix issues like #5769

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Not sure what should be added ^

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 19, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vadasambar
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vadasambar vadasambar changed the title [AEP] docs: add wip proposal for kwok provider [AEP] docs: add proposal for kwok provider Jun 19, 2023
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar vadasambar marked this pull request as ready for review June 20, 2023 05:06
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2023
@k8s-ci-robot k8s-ci-robot requested a review from x13n June 20, 2023 05:07
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
@vadasambar
Copy link
Member Author

Note that the original plan was to create a first draft and then edit the proposal to match the KEP template but after discussion in sig-autoscaling meeting yesterday, I found out that I don't need to write a proposal to add a cloud provider.

I think having a proposal is still valuable (but it doesn't follow standard KEP template). It helps users understand the motivation for adding the cloud provider and it provides a document everyone can review and add comments/suggestions in. I will keep this PR open for reviews/suggestions from users/contributors before merging it in.

I have asked kwok and sig-scalability for review:

* `BuildKwokCloudProvider`:
* Create a lister to watch for nodes with the annotation `kwok.x-k8s.io/node: fake`
* The annotation is used to identify nodes created by the `kwok` provider (`kwok` provider creates nodes with this annotation). Limiting the lister `kwok` created nodes ensure we don't touch actual nodes.
* Use `KWOK_TEMPLATES_PATH` env variable to fetch user defined template nodes. This is so that the users can specify their own custom template nodes. This can be the yaml result of `kubectl get nodes` from a different cluster with some changes. Or, it can be nodes defined by the user from scratch in yaml. The yaml has to be mounted in the CA pod at the path specified in `KWOK_TEMPLATES_PATH` env variable
Copy link
Member

@wzshiming wzshiming Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes-sigs/kwok#668

I'm also designing this Mock data, if you're interested take look at the API

Copy link
Member Author

@vadasambar vadasambar Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing the link. kubernetes-sigs/kwok#668 looks interesting. 👍

You can see a PoC implementation of rest of the functions at https://github.com/kubernetes/autoscaler/pull/5820/files#diff-44474ffb56eda61e9a6b16c3ca66461cdf8e02a7e89b659f5a45ca32f5fa8588 (only supports template nodes passed via env variable as of writing this)

### Things to note
1. Once the user is done with using the `kwok` provider, `--cloudprovider` flag can be reset to any other cloud provider. Deleting the CA pod will cleanup all the fake nodes created by the `kwok` provider and restore the cluster to its original state. This behavior doesn't seem to be supported as of writing this (based on my tests).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the CA pod will cleanup all the fake nodes created by the kwok provider and restore the cluster to its original state. This behavior doesn't seem to be supported as of writing this (based on my tests).

Yes, kwok will not delete any nodes, we need to filter the nodes to be deleted based on the label/annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior doesn't seem to be supported as of writing this

Sorry for the confusion. I don't expect kwok to clean up nodes. This here is CA. I need to implement a go interface for this proposal. This interface has a Cleanup() function. My expectation is if I implement this function, CA would call it when the CA pod is terminating. But, this expectation is wrong because Cleanup function is for cleaning go routines when the CA pod is terminating. This is why I said This behavior doesn't seem to be supported as of writing this (based on my tests)..

TODO: change the sentence to make it less confusing

* relevant unit tests have been added
* `kwok` cloud provider is `demo`ed in one of the sig-autoscaling meetings

### FAQ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question, as I understand it, CA depends on metric-server, right? I'm not sure if kwok nodes can scale up/down properly without /metrics/resource. I'm currently working on supporting metric-server.

kubernetes-sigs/kwok#480

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the link. Fortunately, CA doesn't depend on kubectl top pods or kubectl top nodes metrics. It is based on kube state metrics (it uses pod CPU and memory requests metrics).

That being said, having metrics server would definitely be nice. 😄

@vadasambar
Copy link
Member Author

@wzshiming thank you for taking a look at my proposal 🙏 ❤️

@x13n
Copy link
Member

x13n commented Oct 10, 2023

/assign

This looks quite interesting, I think some scalability issues could indeed be found with such cloud provider. Definitely not all of them, since some of them are cloud provider specific. I'd be happy to marge this, though maybe it would make sense to discuss on the sig meeting first? Or did that happen already? (I see this PR is quite old by now.)

@vadasambar
Copy link
Member Author

vadasambar commented Oct 10, 2023

/assign

This looks quite interesting, I think some scalability issues could indeed be found with such cloud provider. Definitely not all of them, since some of them are cloud provider specific. I'd be happy to marge this, though maybe it would make sense to discuss on the sig meeting first? Or did that happen already? (I see this PR is quite old by now.)

Thank you for showing interest ❤️

I already talked about this proposal in the sig meeting. Based on the discussion, I don't need an AEP to add a new cloudprovider. I have kept this PR so that people can find it easily.

I am working on getting the first release out. I plan to include this proposal as a doc in the PR.

PR: #5820

I think the code can improve (would love your review :) ). I want to get the first release out so that users can try it and work on the feedback.

@x13n
Copy link
Member

x13n commented Oct 13, 2023

Gotcha. I'd close this PR then since you're already working on the other one. Anyone reading this will know where to look next. Does that make sense?

Also, #5820 is currently marked as "DON'T REVIEW". Is it already in reviewable state?

@vadasambar
Copy link
Member Author

Gotcha. I'd close this PR then since you're already working on the other one. Anyone reading this will know where to look next. Does that make sense?

Agreed. I will close this PR.

Also, #5820 is currently marked as "DON'T REVIEW". Is it already in reviewable state?

I am writing tests right now. I would like to get a review on the PR in its current state but I will refrain (don't want to waste your time). Once I am done with the tests, I will mention you in the PR to let you know it is ready.

@vadasambar vadasambar closed this Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants