Distributive Gloo PyTorchJob example doesn't work #1358

andreyvelich · 2021-08-14T19:09:54Z

/kind bug

I tried to run this PyTorch distributive Gloo backend example with the new Training Controller.
It stuck at this step: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L132

This is example YAML:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

This is source code for this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py.

This is output of kubectl describe pod pytorch-master-0 -n kubeflow

Name:         pytorch-master-0
Namespace:    kubeflow
Priority:     0
Node:         gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2/10.154.0.2
Start Time:   Sat, 14 Aug 2021 19:59:47 +0100
Labels:       group-name=kubeflow.org
              job-name=pytorch
              job-role=master
              replica-index=0
              replica-type=master
Annotations:  <none>
Status:       Running
IP:           10.40.3.36
IPs:
  IP:           10.40.3.36
Controlled By:  PyTorchJob/pytorch
Containers:
  pytorch:
    Container ID:  docker://8b6cf6717832e4713048c15aedfa6dc46deadf1a159fdf3fbea895257f3bac17
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      docker-pullable://kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
    Port:          23456/TCP
    Host Port:     0/TCP
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Running
      Started:      Sat, 14 Aug 2021 19:59:50 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sjg9v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-sjg9v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sjg9v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  2m8s  default-scheduler  Successfully assigned kubeflow/pytorch-master-0 to gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2
  Normal  Pulling    2m7s  kubelet            Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
  Normal  Pulled     2m5s  kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 1.640364168s
  Normal  Created    2m5s  kubelet            Created container pytorch
  Normal  Started    2m5s  kubelet            Started container pytorch

This is output of kubectl describe pod pytorch-worker-0 -n kubeflow.

Name:         pytorch-worker-0
Namespace:    kubeflow
Priority:     0
Node:         gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2/10.154.0.2
Start Time:   Sat, 14 Aug 2021 19:59:47 +0100
Labels:       group-name=kubeflow.org
              job-name=pytorch
              replica-index=0
              replica-type=worker
Annotations:  <none>
Status:       Running
IP:           10.40.3.37
IPs:
  IP:           10.40.3.37
Controlled By:  PyTorchJob/pytorch
Containers:
  pytorch:
    Container ID:  docker://aa70715410a75e2f13a3f0b6b29686fffc64b2b4d3a0ed99edbe6c1ce85ce5f8
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      docker-pullable://kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
    Port:          <none>
    Host Port:     <none>
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Running
      Started:      Sat, 14 Aug 2021 19:59:52 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sjg9v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-sjg9v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sjg9v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m58s  default-scheduler  Successfully assigned kubeflow/pytorch-worker-0 to gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2
  Normal  Pulling    2m56s  kubelet            Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
  Normal  Pulled     2m53s  kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 3.092993949s
  Normal  Created    2m53s  kubelet            Created container pytorch
  Normal  Started    2m53s  kubelet            Started container pytorch

The text was updated successfully, but these errors were encountered:

andreyvelich · 2021-08-14T19:10:04Z

cc @kubeflow/wg-training-leads

Jeffwan · 2021-08-14T20:44:09Z

I can reproduce the issue. I ssh to pods and looks env injection is correct.

HOSTNAME=pytorch-master-0
MASTER_PORT=23456
WORLD_SIZE=2
MASTER_ADDR=pytorch-master-0
RANK=1

Let me double check port and service setting

Update: rank setting is incorrect

k describe pod pytorch-master-0
Name:         pytorch-master-0
Namespace:    kubeflow
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Sat, 14 Aug 2021 13:37:24 -0700
Labels:       group-name=kubeflow.org
              job-name=pytorch
              job-role=master
              replica-index=0
              replica-type=master

    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0


 k describe pod pytorch-worker-0
Name:         pytorch-worker-0
Namespace:    kubeflow
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Sat, 14 Aug 2021 13:37:24 -0700
Labels:       group-name=kubeflow.org
              job-name=pytorch
              replica-index=0
              replica-type=worker

    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0

Jeffwan · 2021-08-14T21:12:02Z

Figure out the issue.

Originally, it uses real rType to compare with pyv1. PyTorchReplicaTypeMaster

https://github.com/kubeflow/pytorch-operator/blob/a502590d8d340186604e695c55b4cc6cea5cee0d/pkg/controller.v1/pytorch/pod.go#L246

In kubeflow/common mode, it passes rt which is the lower case of rType

https://github.com/kubeflow/common/blob/f162091f3ea6b2275635d48116dd67c1b344ef61/pkg/controller.v1/common/pod.go#L367

https://github.com/kubeflow/common/blob/f162091f3ea6b2275635d48116dd67c1b344ef61/pkg/controller.v1/common/pod.go#L339-L340

PyTorch was not migrated the kubeflow/common so this is the first time we see the issue

/cc @zw0610

Jeffwan · 2021-08-14T21:15:45Z

This is misleading we mix use rt and rType in the code base. So https://github.com/kubeflow/common/pull/135/files resolves this issue. I think I can bump dependency to 0.3.5 to fix this issue.

Another way is to manually covert pyv1. PyTorchReplicaTypeMaster to lower case for string comparison.

andreyvelich · 2021-08-14T21:34:58Z

Sounds good, thank you for the investigation @Jeffwan!
I think we should bump the dependency.

Jeffwan · 2021-08-14T21:40:11Z

I determine to change strings.ToLower() now. I notice this change https://github.com/kubeflow/common/pull/135/files in v0.3.5 may bring some issues. We do use same way for bytePS's case here.

https://github.com/kubeflow/tf-operator/blob/52cddeceba1e31e54a2e34551f486675fafabda2/pkg/controller.v1/mxnet/mxnet.go#L226-L233

@andreyvelich Please use this image for testing. kubeflow/training-operator:4d4cf6485eb40d4e6e4badb03d341b8b78c2ec92

Jeffwan · 2021-08-14T21:47:06Z

/priority p0

andreyvelich · 2021-08-14T22:20:42Z

@Jeffwan Yes, this image is working.
Also, I checked that Katib Example with PyTorch is working also: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorchjob-example.yaml.
Thank you!

Jeffwan · 2021-08-16T03:26:29Z

fix has been merged and manifest is up to date. We can close the issue.

google-oss-robot added the kind/bug label Aug 14, 2021

Jeffwan mentioned this issue Aug 14, 2021

fix incorrect torch env population #1361

Merged

google-oss-robot added the priority/p0 label Aug 14, 2021

Jeffwan self-assigned this Aug 14, 2021

Jeffwan closed this as completed Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributive Gloo PyTorchJob example doesn't work #1358

Distributive Gloo PyTorchJob example doesn't work #1358

andreyvelich commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 14, 2021 •

edited

Loading

Jeffwan commented Aug 14, 2021 •

edited

Loading

Jeffwan commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 14, 2021

Jeffwan commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 16, 2021

Distributive Gloo PyTorchJob example doesn't work #1358

Distributive Gloo PyTorchJob example doesn't work #1358

Comments

andreyvelich commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 14, 2021 • edited Loading

Jeffwan commented Aug 14, 2021 • edited Loading

Jeffwan commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 14, 2021

Jeffwan commented Aug 14, 2021

andreyvelich commented Aug 14, 2021

Jeffwan commented Aug 16, 2021

Jeffwan commented Aug 14, 2021 •

edited

Loading

Jeffwan commented Aug 14, 2021 •

edited

Loading