Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement kwok cloud provider #5820

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented May 30, 2023

Signed-off-by: vadasambar surajrbanakar@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements kwok provider (kwok is a more performant alternative to kubemark; it doesn't require creation of a separate master/cluster for testing) for scale testing CA to identify and fix issues like #5769

More more info, check the README

Which issue(s) this PR fixes:

There is no issue for this PR. I can create one if needed.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added: new kwok cloud provider (check https://github.com/kubernetes/autoscaler/blob/kwok-poc/cluster-autoscaler/cloudprovider/kwok/README.md) for more info

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2023
@vadasambar vadasambar changed the title [DON'T REVIEW] feat: wip implement CloudProvider interface boilerplate for kwok provider [DON'T REVIEW] feat: implement kwok cloud provider May 30, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 31, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 7, 2023
@vadasambar
Copy link
Member Author

vadasambar commented Jun 12, 2023

Here's what it looks like in action

image

Here's what the new node looks like:
suraj@suraj:~/sandbox/go-sandbox/node-templates$ k get no ng-2-62588 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    cluster-autoscaler.kwok.nodegroup/name: ng-2
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
    kwok.x-k8s.io/node: fake
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2023-06-07T05:15:19Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: kind-worker
    kubernetes.io/os: linux
  name: ng-2-62588
  resourceVersion: "111083"
  uid: 7bd7ff54-0e0a-48b3-bccc-9b3e75658850
spec:
  podCIDR: 10.244.1.0/24
  podCIDRs:
  - 10.244.1.0/24
  providerID: kwok:ng-2-62588
status:
  addresses:
  - address: 172.18.0.3
    type: InternalIP
  - address: kind-worker
    type: Hostname
  allocatable:
    cpu: "12"
    ephemeral-storage: 959786032Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32781516Ki
    pods: "110"
  capacity:
    cpu: "12"
    ephemeral-storage: 959786032Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 32781516Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-06-07T05:16:24Z"
    lastTransitionTime: "2023-06-07T05:16:24Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready
  - lastHeartbeatTime: "2023-06-07T05:16:24Z"
    lastTransitionTime: "2023-06-07T05:16:24Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-06-07T05:16:24Z"
    lastTransitionTime: "2023-06-07T05:16:24Z"
    message: kubelet has no disk pressure
    reason: KubeletHasNoDiskPressure
    status: "False"
    type: DiskPressure
  - lastHeartbeatTime: "2023-06-07T05:16:24Z"
    lastTransitionTime: "2023-06-07T05:16:24Z"
    message: kubelet has sufficient PID available
    reason: KubeletHasSufficientPID
    status: "False"
    type: PIDPressure
  - lastHeartbeatTime: "2023-06-07T05:16:24Z"
    lastTransitionTime: "2023-06-07T05:16:24Z"
    message: RouteController created a route
    reason: RouteCreated
    status: "False"
    type: NetworkUnavailable
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10247
  images:
  - names:
    - registry.k8s.io/etcd:3.5.6-0
    sizeBytes: 102542580
  - names:
    - docker.io/library/import-2023-03-30@sha256:ba097b515c8c40689733c0f19de377e9bf8995964b7d7150c2045f3dfd166657
    - registry.k8s.io/kube-apiserver:v1.26.3
    sizeBytes: 80392681
  - names:
    - docker.io/library/import-2023-03-30@sha256:8dbb345de79d1c44f59a7895da702a5f71997ae72aea056609445c397b0c10dc
    - registry.k8s.io/kube-controller-manager:v1.26.3
    sizeBytes: 68538487
  - names:
    - docker.io/library/import-2023-03-30@sha256:44db4d50a5f9c8efbac0d37ea974d1c0419a5928f90748d3d491a041a00c20b5
    - registry.k8s.io/kube-proxy:v1.26.3
    sizeBytes: 67217404
  - names:
    - docker.io/library/import-2023-03-30@sha256:3dd2337f70af979c7362b5e52bbdfcb3a5fd39c78d94d02145150cd2db86ba39
    - registry.k8s.io/kube-scheduler:v1.26.3
    sizeBytes: 57761399
  - names:
    - docker.io/kindest/kindnetd:v20230330-48f316cd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af
    - docker.io/kindest/kindnetd@sha256:c19d6362a6a928139820761475a38c24c0cf84d507b9ddf414a078cf627497af
    sizeBytes: 27726335
  - names:
    - docker.io/kindest/local-path-provisioner:v0.0.23-kind.0@sha256:f2d0a02831ff3a03cf51343226670d5060623b43a4cfc4808bd0875b2c4b9501
    sizeBytes: 18664669
  - names:
    - registry.k8s.io/coredns/coredns:v1.9.3
    sizeBytes: 14837849
  - names:
    - docker.io/kindest/local-path-helper:v20230330-48f316cd@sha256:135203f2441f916fb13dad1561d27f60a6f11f50ec288b01a7d2ee9947c36270
    sizeBytes: 3052037
  - names:
    - registry.k8s.io/pause:3.7
    sizeBytes: 311278
  nodeInfo:
    architecture: amd64
    bootID: ff1e4c75-5e08-49d6-b47e-808d9b298720
    containerRuntimeVersion: containerd://1.6.19-46-g941215f49
    kernelVersion: 5.15.0-72-generic
    kubeProxyVersion: v1.26.3
    kubeletVersion: v1.26.3
    machineID: 57c8e586bd8b4b2b80d28d156ce6ef03
    operatingSystem: linux
    osImage: Ubuntu 22.04.2 LTS
    systemUUID: Ubuntu 22.04.2 LTS
  phase: Running

Here's what it looks like in Grafana (this is from a different iteration of the test)
image

Dashboard is taken as is from https://grafana.com/grafana/dashboards/3831-autoscaler/ with a small change (I have renamed y axis legend for Pod activity panel)

@elmiko
Copy link
Contributor

elmiko commented Jun 12, 2023

i'm very curious about the comparisons between kwok and kubemark.

also, why not create this as a gRPC based provider?

as a gRPC provider there won't be any limitations/restrictions based on adding code to this repository.

@qianlei90
Copy link
Contributor

qianlei90 commented Jun 13, 2023

I think this provider makes contributors debug or develop CA locally easier than kubemark, and maybe it's a lightweight choice for CI/CD. This provider can also be a startup or a tutorials for those who want contribute to CA but lack of cloud resources.

@vadasambar
Copy link
Member Author

i'm very curious about the comparisons between kwok and kubemark.

Thanks for the comment :D

I am working on a load test for #5769. I will try to post some data on that tomorrow and link here. I am hoping it shows a more concrete difference between the two of them performance wise. I didn't get time to elaborate more on this. I will try to do that in an upcoming comment. Most of it boils down:

  1. I wanted to test issues like CA scale-up delays on clusters with heavy scaling activity #5769 without having to spin up resources in a public cloud provider. kubemark needed 100m CPU per fake node (which would be 1 core per upto 10 nodes). I wanted a more lightweight solution.

    Currently we're running HollowNode with a limit of 0.09 CPU core/pod and 220MB of memory. However, if we also take into account the resources absorbed by default cluster addons and fluentD running on the 'external' cluster, this limit becomes ~0.1 CPU core/pod, thus allowing ~10 HollowNodes to run per core (on an "n1-standard-8" VM node).

    https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md#starting-a-kubemark-cluster

  2. Easier testing like @qianlei90 mentioned

  3. I found it easier to install and try (feel free to disagree here :))

  4. This is a little ambitious but I hope kwok can be used as a dry-run option for different cloud providers

also, why not create this as a gRPC based provider?

as a gRPC provider there won't be any limitations/restrictions based on adding code to this repository.

I wanted to be able to scale-test issues around different cloud providers i.e., 4. Implementing the kwok provider in-tree means we are closer to the actual implementation of our (I think) most-used cloud providers (adding gRPC communication in between would mean an extra delay which is not there in our in-tree cloud providers). I was hoping for something like this:

  1. Implement in-tree provider to cover most of the common use-cases
  2. Implement kwok provider for clusterapi so that we can provision kwok nodes using clusterapi provider
  3. Implement gRPC provider if there is user demand

For 2, Richard from CAPA is already working on it.

@vadasambar
Copy link
Member Author

vadasambar commented Jun 14, 2023

@elmiko you might be interested in #5769 (comment)

If you check the resource consumption for the kwok-controller* pod, you can create close to 400 nodes in well under 100m of CPU as opposed to kubemark where 100m of CPU per node is recommended.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 25, 2023
@k8s-ci-robot k8s-ci-robot added area/helm-charts and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Aug 8, 2023
@@ -120,6 +122,7 @@ rules:
verbs:
- list
- watch
- get
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get kwok-provider-config and kwok-provider-templates ConfigMap

@@ -42,6 +42,8 @@ rules:
verbs:
- watch
- list
- create
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For creating and deleting nodes.

apiVersion: v1
kind: ConfigMap
metadata:
name: kwok-provider-templates
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sample kwok-provider-templates which helps users to get started with the kwok provider immediately (kwok-provider-config is in tune with this configmap).

// Refresh is called before every main loop and can be used to dynamically update cloud provider state.
// In particular the list of node groups returned by NodeGroups can change as a result of CloudProvider.Refresh().
// TODO(vadasambar): implement this
func (kwok *KwokCloudProvider) Refresh() error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed for the first release but good to have in the future.

@vadasambar vadasambar changed the title [DON'T REVIEW] feat: implement kwok cloud provider feat: implement kwok cloud provider Nov 16, 2023
@vadasambar vadasambar marked this pull request as ready for review November 16, 2023 17:40
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2023
@vadasambar
Copy link
Member Author

@x13n mentioning you since you said you were interested in the PR.
The goal is to get the first release of the kwok provider out so that we can get user feedback. Any sort of review/feedback is welcome. I want to address review/feedback comments in a separate PR (I will create separate issue if needed) unless they block the release.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2023
@vadasambar
Copy link
Member Author

For people present during the kwok provider demo and interested in why scale-up didn't happen in the end. Scale-up didn't happen because I only had two instance-type values in the template nodes but I tried scaling the sample workload to 5 replicas with inter pod-affinity based on instance-type label. That's all!

image
image
image

@vadasambar
Copy link
Member Author

vadasambar commented Nov 20, 2023

TODOs after demo in the sig meeting:

  • squash all the commits into 1
  • do another pass through of the PR. I had to delete the pod stuck in Pending state to get into Running state even though node was present to schedule it (seems like an underlying kwok issue?)

func buildCloudProvider(opts config.AutoscalingOptions,
do cloudprovider.NodeGroupDiscoveryOptions,
rl *cloudprovider.ResourceLimiter,
allNodesLister listersv1.NodeLister) cloudprovider.CloudProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a nodeLister to be passed? Can we pass something more generic instead (client)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment.

The intention was to pass only what is needed. A caveat however is, cloudprovider implementers would have to change this if they need more than just node lister and possibly touch other cloudprovider's code to make it compatible as well.

My initial implementation was based on informer (which is more generic). We can pass it here. I can make that change. Let me know if you have any differing thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing in informers now.

feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add builder for `kwok`
- add logic to scale up and scale down nodes in `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip parse node templates from file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add short README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement remaining things
- to get the provider in a somewhat working state
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add in-cluster `kwok` as pre-requisite in the README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: templates file not correctly marshalling into node list
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `invalid leading UTF-8 octet` error during template parsing
- remove encoding using `gob`
- not required
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: use lister to get and list
- instead of uncached kube client
- add lister as a field on the provider and nodegroup struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `did not find nodegroup annotation` error
- CA was thinking the annotation is not present even though it is
- fix a bug with parsing annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CA node recognizing fake nodegroups
- add provider ID to nodes in the format `kwok:<node-name>`
- fix invalid `KwokManagedAnnotation`
- sanitize template nodes (remove `resourceVersion` etc.,)
- not sanitizing the node leads to error during creation of new nodes
- abstract code to get NG name into a separate function `getNGNameFromAnnotation`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: node not getting deleted
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add empty test file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: add OWNERS file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip kwok provider config
- add samples for static and dynamic template nodes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip implement pulling node templates from cluster
- add status field to kwok provider config
- this is to capture how the nodes would be grouped by (can be annotation or label)
- use kwok provider config status to get ng name from the node template
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: syntax error in calling `loadNodeTemplatesFromCluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: first draft of dynamic node templates
- this allows node templates to be pulled from the cluster
- instead of having to specify static templates manually
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: syntax error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract out related code into separate files
- use named constants instead of hardcoded values
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: cleanup kwok nodes when CA is exiting
- so that the user doesn't have to cleanup the fake nodes themselves
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: return `nil` instead of err for `HasInstance`
- because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: start working on tests for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add `gpuLabelKey` under `nodes` field in kwok provider config
- fix validation for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add motivation doc
- update README with more details
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update kwok provider config example to support pulling gpu labels and types from existing providers
- still needs to be implemented in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip update kwok provider config to get gpu label and available types
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip read gpu label and available types from specified provider
- add available gpu types in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add validation for gpu fields in kwok provider config
- load gpu related fields in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add support to install and uninstall kwok
- add option to disable installation
- add option to manually specify kwok release tag
- add future scope in readme
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add future scope 'evaluate adding support to check if kwok controller already exists'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: vendor conflict and cyclic import
- remove support to get gpu config from the specified provider (can't be used because leads to cyclic import)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add a TODO 'get gpu config from other providers'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `file` -> `configmap`
- load config and templates from configmap instead of file
- move `nodes` and `nodegroups` config to top level
- add helper to encode configmap data into `[]bytes`
- add helper to get current pod namespace
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add new options to the kwok provider config
- auto install kwok only if the version is >= v0.4.0
- add test for `GPULabel()`
- use `kubectl apply` way of installing kwok instead of kustomize
- add test for kwok helpers
- add test for kwok config
- inject service account name in CA deployment
- add example configmap for node templates and kwok provider config in CA helm chart
- add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests)
- update kwok provider sample configs
- update `README`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: update go.mod to use v1.28 packages
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: `go mod tidy` and `go mod vendor` (again)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: kwok installation code
- add functions to create and delete clusterrolebinding to create kwok resources
- refactor kwok install and uninstall fns
- delete manifests in the opposite order of install ]
- add cleaning up left-over kwok installation to future scope
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: nil ptr error
- add `TODO` in README for adding docs around kwok config fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove code to automatically install and uninstall `kwok`
- installing/uninstalling requires strong permissions to be granted to `kwok`
- granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase
- this can pose a security risk
- I have removed the code related to install and uninstall for now
- will proceed after discussion with the community
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod tidy` and `go mod vendor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add permission to create nodes
- to fix permissions error for kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add more unit tests
- add tests for kwok helpers
- fix and update kwok config tests
- fix a bug where gpu label was getting assigned to `kwokConfig.status.key`
- expose `loadConfigFile` -> `LoadConfigFile`
- throw error if templates configmap does not have `templates` key (value of which is node templates)
- finish test for `GPULabel()`
- add tests for `NodeGroupForNode()`
- expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap`
- fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract provider ID code into `getProviderID` fn
- fix provider name in test `kwok` -> `kwok:kind-worker-xxx`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod vendor` and `go mod tidy
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"`
- `"kubernetes.io/hostname"` leads to infinite scale-up
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: support running CA with kwok provider locally
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use global informer factory
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: `Cleanup()` logic
- clean up only nodes managed by the kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix/refactor: nodegroup creation logic
- fix issue where fake node was getting created which caused fatal error
- use ng annotation to keep track of nodegroups
- (when creating nodegroups) don't process nodes which don't have the right ng nabel
- suffix ng name with unix timestamp
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup`
- pass only the required node lister to cloud provider instead of the entire informer factory
- pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config
- implement fake node lister for testing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test case for dynamic templates in `TestNodeGroupForNode`
- remove non-required fields from template node
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `NodeGroups()`
- add extra node template without ng selector label to add more variability in the test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: write tests for `GetNodeGpuConfig()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test for `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test for `GetResourceLimiter()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()`
- abstract error msgs into variables to use them in tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn
- add check for deleting too many nodes
- rename err msg var names to make them consistent
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()`
- abstract error msgs into variables (for easy use in tests)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `Nodes()`
- add extra test case for `DecreaseTargetSize()` to check lister error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `TemplateNodeInfo`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): improve tests for `BuildKwokProvider()`
- add more test cases
- refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `GetOptions`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test
- not doing so leads to failure in other tests
- remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: bump CA chart version
- this is because of changes made related to kwok
- fix type `everwhere` -> `everywhere`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: fix linting checks
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: address CI lint errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: generate helm docs for `kwokConfigMapName`
- remove `KWOK_CONFIG_MAP_KEY` (not being used in the code)
- bump helm chart version
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: revise the outline for README
- add AEP link to the motivation doc
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: wip create an outline for the README
- remove `kwok` field from examples (not needed right now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add outline for ascii gifs
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update README with info around installation and benefits of using kwok provider
- add `Kwok` as a provider in main CA README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod vendor`
- remove TODOs that are not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: finish first draft of README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove redundant/deprecated code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: bump chart version `9.30.1` -> `9.30.2`
- because of kwok provider related changes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: fix typo `offical` -> `official`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: remove debug log msg
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add links for getting help
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix type in log `external cluster` -> `cluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: add newline in chart.yaml to fix CI lint
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix mistake `sig-kwok` -> `sig-scheduling`
- kwok is a part if sig-scheduling (there is no sig-kwok)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix type `release"` -> `"release"`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: pass informer instead of lister to cloud provider builder fn
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus, vadasambar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 27, 2023
@k8s-ci-robot k8s-ci-robot merged commit e23e63a into kubernetes:master Nov 27, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/helm-charts area/provider/externalgrpc Issues or PRs related to the External gRPC provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants