Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Three issues when upgrading from Kubeflow 1.3 to 1.5: Training Operator, Katib and artifacts #382

Closed
pablofiumara-azumo opened this issue Aug 23, 2022 · 10 comments

Comments

@pablofiumara-azumo
Copy link

  1. Training operator: When executing make apply, I get:
Resource: "apiextensions.k8s.io/v1, Resource=customresourcedefinitions", GroupVersionKind: "apiextensions.k8s.io/v1, Kind=CustomResourceDefinition"
Name: "tfjobs.kubeflow.org", Namespace: ""
for: "apps/training-operator/build/apiextensions.k8s.io_v1_customresourcedefinition_tfjobs.kubeflow.org.yaml": CustomResourceDefinition.apiextensions.k8s.io "tfjobs.kubeflow.org" is invalid: spec.preserveUnknownFields: Invalid value: true: must be false in order to use defaults in the schema
make: *** [Makefile:83: apply] Error 1

  1. Katib

If I execute

kubectl logs katib-db-manager -n kubeflow
I get

E0823 23:47:01.897501       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:06.889384       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:11.881359       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:16.873428       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:21.929521       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:26.921382       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:31.913463       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:36.905402       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:41.897387       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:46.889461       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:51.881540       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
E0823 23:47:56.873399       1 mysql.go:78] Ping to Katib db failed: dial tcp oneIp:3306: connect: connection refused
F0823 23:47:56.873464       1 main.go:99] Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.

If I execute

kubectl describe pod katib-mysql -n kubeflow
I get

  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               33m                default-scheduler        Successfully assigned kubeflow/katib-mysql to gke-kubeflowNameCluster-default-pool
  Normal   SuccessfulAttachVolume  33m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-aoneId"
  Normal   Pulled                  32m (x4 over 33m)  kubelet                  Container image "mysql:8.0.26" already present on machine
  Normal   Created                 32m (x4 over 33m)  kubelet                  Created container katib-mysql
  Normal   Started                 32m (x4 over 33m)  kubelet                  Started container katib-mysql
  Warning  Unhealthy               32m (x3 over 33m)  kubelet                  Startup probe failed: mysqladmin: [Warning] Using a password on the command line interface can be insecure.
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)'
Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!
  Warning  BackOff  3m10s (x158 over 33m)  kubelet  Back-off restarting failed container

  1. I get the following error

image

@pablofiumara
Copy link

@zijianjoy Can you share your technical opinion about this, please?

@pablofiumara
Copy link

@fabito Can you share your technical opinion about this, please?

@chensun
Copy link
Member

chensun commented Sep 22, 2022

@pablofiumara do you use the Cloud SQL DB or in cluster DB?

@pablofiumara
Copy link

@chensun Thank you very much for your answer. I use Cloud SQL DB

@zijianjoy
Copy link
Collaborator

@pablofiumara One consideration is that we might want to upgrade mySQL version to MYSQL_8_0: https://github.com/kubeflow/gcp-blueprints/blob/master/kubeflow/common/managed-storage/cloudsql/sql-instance.yaml#L24. Can you try this and verify if the Katib can connect to new database?

@zijianjoy
Copy link
Collaborator

cc @gkcalat for upgrading MySQL version in the future release.

@pablofiumara
Copy link

@zijianjoy Thank you very much for your answer. I will try that

@gkcalat gkcalat mentioned this issue Sep 27, 2022
3 tasks
@gkcalat
Copy link
Contributor

gkcalat commented Oct 3, 2022

@pablofiumara were you able to fix this issue?

@pablofiumara
Copy link

@gkcalat I have not had time to work on that

@gkcalat
Copy link
Contributor

gkcalat commented Oct 17, 2022

FYI, @pablofiumara we have upgraded Cloud SQL in Kubeflow v1.6.1. The corresponding KFP v2.0.0-alpha.6 also uses MySQL 8.0.

@gkcalat gkcalat closed this as completed Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants