You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When launching the public pytorch examples for jobsets (doc link), I had to change the MASTER_ADDR value from pytorch-workers-0-0.pytorch-workers -> pytorch-workers-0-0.pytorch.train.svc.cluster.local for the pods to connect.
Is this something specific to my installation? Or do the examples need to be updated?
Exact error log from resnet.yaml
[W socket.cpp:558] [c10d] The IPv6 network addresses of (pytorch-workers-0-0.pytorch-workers, 3389) cannot be retrieved (gai error: -2 - Name or service not known).
Setup
I launched the jobset into specific namespace named train. kubectl apply -f jobset.yaml -n train --server-side
We are using the v0.5.2 of jobset kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.5.2/manifests.yaml
I have verified that the the headless service resource exists on jobset creation. I have verified that that the jobset-controller-manager is in a healthy state.
The text was updated successfully, but these errors were encountered:
When launching the public pytorch examples for jobsets (doc link), I had to change the
MASTER_ADDR
value frompytorch-workers-0-0.pytorch-workers
->pytorch-workers-0-0.pytorch.train.svc.cluster.local
for the pods to connect.Is this something specific to my installation? Or do the examples need to be updated?
Exact error log from resnet.yaml
Setup
I launched the
jobset
into specific namespace namedtrain
.kubectl apply -f jobset.yaml -n train --server-side
We are using the
v0.5.2
of jobsetkubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.5.2/manifests.yaml
I have verified that the the headless service resource exists on
jobset
creation. I have verified that that thejobset-controller-manager
is in a healthy state.The text was updated successfully, but these errors were encountered: