You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently trying to use volcano for pytorch training jobs using example from kubeflow website. https://www.kubeflow.org/docs/components/training/pytorch/#monitoring-a-pytorchjob . However, when PyTorchJob was submitted, the training operator got errors: 2022-02-28T09:49:55.707Z ERROR Reconcile PyTorchJob error {"pytorchjob": "default/pytorch-tcp-dist-mnist", "error": "PyTorchJob.kubeflow.org \"pytorch-tcp-dist-mnist\" is invalid: status.replicaStatuses: Invalid value: \"null\": status.replicaStatuses in body must be of type object: \"null\""}.
Once --enable-gane-scheduling=true was set, this error showed up. Although there is this error, the training process can sill go on.
The training operator version is 6c115f6e00e3f2c979c6aa4bf2d93906a646b99d and the volcano version is v1.4.0-Beta-10c65af0
Any hints or helps will be appreciated!
The text was updated successfully, but these errors were encountered:
Hi communities.
I'm currently trying to use volcano for pytorch training jobs using example from kubeflow website. https://www.kubeflow.org/docs/components/training/pytorch/#monitoring-a-pytorchjob . However, when PyTorchJob was submitted, the training operator got errors:
2022-02-28T09:49:55.707Z ERROR Reconcile PyTorchJob error {"pytorchjob": "default/pytorch-tcp-dist-mnist", "error": "PyTorchJob.kubeflow.org \"pytorch-tcp-dist-mnist\" is invalid: status.replicaStatuses: Invalid value: \"null\": status.replicaStatuses in body must be of type object: \"null\""}
.Once --enable-gane-scheduling=true was set, this error showed up. Although there is this error, the training process can sill go on.
The training operator version is
6c115f6e00e3f2c979c6aa4bf2d93906a646b99d
and the volcano version isv1.4.0-Beta-10c65af0
Any hints or helps will be appreciated!
The text was updated successfully, but these errors were encountered: