[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

tylerpotts · 2022-05-27T16:50:38Z

OS system and architecture in which you are running QHub

Ubuntu 20

Expected behavior

Expect qhub to deploy

Actual behavior

Qhub deployment fails with this error:

[terraform]: │ Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "Role" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"]
[terraform]: │ 
[terraform]: │   with module.kubernetes-autoscaling[0].helm_release.autoscaler,
[terraform]: │   on modules/cluster-autoscaler/main.tf line 1, in resource "helm_release" "autoscaler":
[terraform]: │    1: resource "helm_release" "autoscaler" {
[terraform]: │ 
[terraform]: ╵

How to Reproduce the problem?

Deploy qhub version 0.4.1

Command output

No response

Versions and dependencies used.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.6-eks-14c7a48", GitCommit:"35f06c94ad99b78216a3d8e55e04734a85da3f7b", GitTreeState:"clean", BuildDate:"2022-04-01T03:18:05Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
$ conda --version
conda 4.11.0
$ qhub --version
0.4.1

Anything else?

Updating qhub/qhub/template/stages/03-kubernetes-initialize/modules/cluster-autoscaler/main.tf with the following worked:

  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"
  version    = "9.18.1"

QHub is currently using the stable version 7.1.0 of the autoscaler which has a max version of 8.0.0. This version is trying to use rbac.authorization.k8s.io/v1beta1. This beta API in kubernetes has been disables by default. The latest stable release does fix the problem, but is not currently being updated. I think it's a better idea to use the actively update chart here: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler

By looking at the two charts, we can see that the current autoscaler helm chart is quite far behind the default kubernetes version of 1.22 that QHub uses.

$ helm search repo cluster-autoscaler --versions
NAME                                       	CHART VERSION	APP VERSION	DESCRIPTION                                       
cluster-autoscaler/cluster-autoscaler      	9.18.1       	1.23.0     	Scales Kubernetes worker nodes within autoscali...
...
stable/cluster-autoscaler                  	8.0.0        	1.17.1     	Scales worker nodes within autoscaling groups.    
stable/cluster-autoscaler                  	7.3.4        	1.17.1     	Scales worker nodes within autoscaling groups.    
stable/cluster-autoscaler                  	7.3.3        	1.17.1     	Scales worker nodes within autoscaling groups.

I propose shifting to this alternate autoscaler chart moving forward.

The text was updated successfully, but these errors were encountered:

viniciusdc · 2022-05-30T14:55:05Z

Wonderful, thanks for opening this issue @tylerpotts. Indeed I came to realize that last Friday when some weird behavior started with AWS. Thanks for the detailed information.

tylerpotts · 2022-05-30T15:05:07Z

@viniciusdc Happy to help! I did notice when I finished my deployment that the dask status link is broken. I suspect it's a traefik routing issue, because when I spin up a cluster and click on the task graph link it gives me the "service unavailable" message which is a default traefik error.

Not sure if this new error is related to the new autoscaler or not

viniciusdc · 2022-05-30T15:11:30Z

@viniciusdc Happy to help! I did notice when I finished my deployment that the dask status link is broken. I suspect it's a traefik routing issue, because when I spin up a cluster and click on the task graph link it gives me the "service unavailable" message which is a default traefik error.

Not sure if this new error is related to the new autoscaler or not

Uhm, dunno. I think that's a different one, could you check what is the current cert showing up for the dashboard page? is it a autogenerated one from Traefik/Lets-encrypt?

tylerpotts added the type: bug 🐛 Something isn't working label May 27, 2022

viniciusdc added area: terraform 💾 area:developer-experience 👩🏻‍💻 labels May 30, 2022

viniciusdc added this to the Release v0.4.2 milestone Jun 8, 2022

iameskild modified the milestones: Release v0.4.3, Release v0.4.2 Jun 8, 2022

This was referenced Jun 9, 2022

[RELEASE] v0.4.2 hotfix #1320

Closed

Fix broken AWS, set minimum desired size to 1, enable 0 scaling #1304

Merged

iameskild closed this as completed Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

tylerpotts commented May 27, 2022 •

edited

Loading

viniciusdc commented May 30, 2022

tylerpotts commented May 30, 2022

viniciusdc commented May 30, 2022 •

edited

Loading

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

Comments

tylerpotts commented May 27, 2022 • edited Loading

OS system and architecture in which you are running QHub

Expected behavior

Actual behavior

How to Reproduce the problem?

Command output

Versions and dependencies used.

Anything else?

viniciusdc commented May 30, 2022

tylerpotts commented May 30, 2022

viniciusdc commented May 30, 2022 • edited Loading

tylerpotts commented May 27, 2022 •

edited

Loading

viniciusdc commented May 30, 2022 •

edited

Loading