Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

Closed
tylerpotts opened this issue May 27, 2022 · 3 comments
Closed

[BUG] - AWS deployment failing due to old auto-scaler helm chart #1302

tylerpotts opened this issue May 27, 2022 · 3 comments

Comments

@tylerpotts
Copy link
Contributor

tylerpotts commented May 27, 2022

OS system and architecture in which you are running QHub

Ubuntu 20

Expected behavior

Expect qhub to deploy

Actual behavior

Qhub deployment fails with this error:

[terraform]: │ Error: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "ClusterRoleBinding" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "Role" in version "rbac.authorization.k8s.io/v1beta1", unable to recognize "": no matches for kind "RoleBinding" in version "rbac.authorization.k8s.io/v1beta1"]
[terraform]: │ 
[terraform]: │   with module.kubernetes-autoscaling[0].helm_release.autoscaler,
[terraform]: │   on modules/cluster-autoscaler/main.tf line 1, in resource "helm_release" "autoscaler":
[terraform]: │    1: resource "helm_release" "autoscaler" {
[terraform]: │ 
[terraform]: ╵

How to Reproduce the problem?

Deploy qhub version 0.4.1

Command output

No response

Versions and dependencies used.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.6-eks-14c7a48", GitCommit:"35f06c94ad99b78216a3d8e55e04734a85da3f7b", GitTreeState:"clean", BuildDate:"2022-04-01T03:18:05Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
$ conda --version
conda 4.11.0
$ qhub --version
0.4.1

Anything else?

Updating qhub/qhub/template/stages/03-kubernetes-initialize/modules/cluster-autoscaler/main.tf with the following worked:

  repository = "https://kubernetes.github.io/autoscaler"
  chart      = "cluster-autoscaler"
  version    = "9.18.1"

QHub is currently using the stable version 7.1.0 of the autoscaler which has a max version of 8.0.0. This version is trying to use rbac.authorization.k8s.io/v1beta1. This beta API in kubernetes has been disables by default. The latest stable release does fix the problem, but is not currently being updated. I think it's a better idea to use the actively update chart here: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler

By looking at the two charts, we can see that the current autoscaler helm chart is quite far behind the default kubernetes version of 1.22 that QHub uses.

$ helm search repo cluster-autoscaler --versions
NAME                                       	CHART VERSION	APP VERSION	DESCRIPTION                                       
cluster-autoscaler/cluster-autoscaler      	9.18.1       	1.23.0     	Scales Kubernetes worker nodes within autoscali...
...
stable/cluster-autoscaler                  	8.0.0        	1.17.1     	Scales worker nodes within autoscaling groups.    
stable/cluster-autoscaler                  	7.3.4        	1.17.1     	Scales worker nodes within autoscaling groups.    
stable/cluster-autoscaler                  	7.3.3        	1.17.1     	Scales worker nodes within autoscaling groups.    

I propose shifting to this alternate autoscaler chart moving forward.

@tylerpotts tylerpotts added the type: bug 🐛 Something isn't working label May 27, 2022
@viniciusdc
Copy link
Contributor

Wonderful, thanks for opening this issue @tylerpotts. Indeed I came to realize that last Friday when some weird behavior started with AWS. Thanks for the detailed information.

@tylerpotts
Copy link
Contributor Author

@viniciusdc Happy to help! I did notice when I finished my deployment that the dask status link is broken. I suspect it's a traefik routing issue, because when I spin up a cluster and click on the task graph link it gives me the "service unavailable" message which is a default traefik error.

Not sure if this new error is related to the new autoscaler or not

@viniciusdc
Copy link
Contributor

viniciusdc commented May 30, 2022

@viniciusdc Happy to help! I did notice when I finished my deployment that the dask status link is broken. I suspect it's a traefik routing issue, because when I spin up a cluster and click on the task graph link it gives me the "service unavailable" message which is a default traefik error.

Not sure if this new error is related to the new autoscaler or not

Uhm, dunno. I think that's a different one, could you check what is the current cert showing up for the dashboard page? is it a autogenerated one from Traefik/Lets-encrypt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants