Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributive Gloo PyTorchJob example doesn't work #1358

Closed
andreyvelich opened this issue Aug 14, 2021 · 9 comments
Closed

Distributive Gloo PyTorchJob example doesn't work #1358

andreyvelich opened this issue Aug 14, 2021 · 9 comments

Comments

@andreyvelich
Copy link
Member

/kind bug

I tried to run this PyTorch distributive Gloo backend example with the new Training Controller.
It stuck at this step: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L132

This is example YAML:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

This is source code for this example: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py.

This is output of kubectl describe pod pytorch-master-0 -n kubeflow

Name:         pytorch-master-0
Namespace:    kubeflow
Priority:     0
Node:         gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2/10.154.0.2
Start Time:   Sat, 14 Aug 2021 19:59:47 +0100
Labels:       group-name=kubeflow.org
              job-name=pytorch
              job-role=master
              replica-index=0
              replica-type=master
Annotations:  <none>
Status:       Running
IP:           10.40.3.36
IPs:
  IP:           10.40.3.36
Controlled By:  PyTorchJob/pytorch
Containers:
  pytorch:
    Container ID:  docker://8b6cf6717832e4713048c15aedfa6dc46deadf1a159fdf3fbea895257f3bac17
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      docker-pullable://kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
    Port:          23456/TCP
    Host Port:     0/TCP
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Running
      Started:      Sat, 14 Aug 2021 19:59:50 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sjg9v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-sjg9v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sjg9v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  2m8s  default-scheduler  Successfully assigned kubeflow/pytorch-master-0 to gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2
  Normal  Pulling    2m7s  kubelet            Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
  Normal  Pulled     2m5s  kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 1.640364168s
  Normal  Created    2m5s  kubelet            Created container pytorch
  Normal  Started    2m5s  kubelet            Started container pytorch

This is output of kubectl describe pod pytorch-worker-0 -n kubeflow.

Name:         pytorch-worker-0
Namespace:    kubeflow
Priority:     0
Node:         gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2/10.154.0.2
Start Time:   Sat, 14 Aug 2021 19:59:47 +0100
Labels:       group-name=kubeflow.org
              job-name=pytorch
              replica-index=0
              replica-type=worker
Annotations:  <none>
Status:       Running
IP:           10.40.3.37
IPs:
  IP:           10.40.3.37
Controlled By:  PyTorchJob/pytorch
Containers:
  pytorch:
    Container ID:  docker://aa70715410a75e2f13a3f0b6b29686fffc64b2b4d3a0ed99edbe6c1ce85ce5f8
    Image:         docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    Image ID:      docker-pullable://kubeflowkatib/pytorch-mnist@sha256:5164399299fc6ceebcdfa0df5b303a2d63c05776188f55a336c5d3514a4e3227
    Port:          <none>
    Host Port:     <none>
    Command:
      python3
      /opt/pytorch-mnist/mnist.py
      --epochs=1
    State:          Running
      Started:      Sat, 14 Aug 2021 19:59:52 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-sjg9v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-sjg9v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-sjg9v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m58s  default-scheduler  Successfully assigned kubeflow/pytorch-worker-0 to gke-andrey-k8s-cluster-default-pool-72729d2c-5ao2
  Normal  Pulling    2m56s  kubelet            Pulling image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727"
  Normal  Pulled     2m53s  kubelet            Successfully pulled image "docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727" in 3.092993949s
  Normal  Created    2m53s  kubelet            Created container pytorch
  Normal  Started    2m53s  kubelet            Started container pytorch
@andreyvelich
Copy link
Member Author

cc @kubeflow/wg-training-leads

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

I can reproduce the issue. I ssh to pods and looks env injection is correct.

HOSTNAME=pytorch-master-0
MASTER_PORT=23456
WORLD_SIZE=2
MASTER_ADDR=pytorch-master-0
RANK=1

Let me double check port and service setting


Update: rank setting is incorrect

k describe pod pytorch-master-0
Name:         pytorch-master-0
Namespace:    kubeflow
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Sat, 14 Aug 2021 13:37:24 -0700
Labels:       group-name=kubeflow.org
              job-name=pytorch
              job-role=master
              replica-index=0
              replica-type=master

    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0


 k describe pod pytorch-worker-0
Name:         pytorch-worker-0
Namespace:    kubeflow
Priority:     0
Node:         docker-desktop/192.168.65.4
Start Time:   Sat, 14 Aug 2021 13:37:24 -0700
Labels:       group-name=kubeflow.org
              job-name=pytorch
              replica-index=0
              replica-type=worker

    Environment:
      MASTER_PORT:       23456
      MASTER_ADDR:       pytorch-master-0
      WORLD_SIZE:        2
      RANK:              1
      PYTHONUNBUFFERED:  0

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

Figure out the issue.

Originally, it uses real rType to compare with pyv1. PyTorchReplicaTypeMaster

https://github.com/kubeflow/pytorch-operator/blob/a502590d8d340186604e695c55b4cc6cea5cee0d/pkg/controller.v1/pytorch/pod.go#L246

In kubeflow/common mode, it passes rt which is the lower case of rType

https://github.com/kubeflow/common/blob/f162091f3ea6b2275635d48116dd67c1b344ef61/pkg/controller.v1/common/pod.go#L367

https://github.com/kubeflow/common/blob/f162091f3ea6b2275635d48116dd67c1b344ef61/pkg/controller.v1/common/pod.go#L339-L340

PyTorch was not migrated the kubeflow/common so this is the first time we see the issue

/cc @zw0610

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

This is misleading we mix use rt and rType in the code base. So https://github.com/kubeflow/common/pull/135/files resolves this issue. I think I can bump dependency to 0.3.5 to fix this issue.

Another way is to manually covert pyv1. PyTorchReplicaTypeMaster to lower case for string comparison.

@andreyvelich
Copy link
Member Author

Sounds good, thank you for the investigation @Jeffwan!
I think we should bump the dependency.

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

I determine to change strings.ToLower() now. I notice this change https://github.com/kubeflow/common/pull/135/files in v0.3.5 may bring some issues. We do use same way for bytePS's case here.

https://github.com/kubeflow/tf-operator/blob/52cddeceba1e31e54a2e34551f486675fafabda2/pkg/controller.v1/mxnet/mxnet.go#L226-L233

@andreyvelich Please use this image for testing. kubeflow/training-operator:4d4cf6485eb40d4e6e4badb03d341b8b78c2ec92

@Jeffwan
Copy link
Member

Jeffwan commented Aug 14, 2021

/priority p0

@andreyvelich
Copy link
Member Author

@Jeffwan Yes, this image is working.
Also, I checked that Katib Example with PyTorch is working also: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorchjob-example.yaml.
Thank you!

@Jeffwan
Copy link
Member

Jeffwan commented Aug 16, 2021

fix has been merged and manifest is up to date. We can close the issue.

@Jeffwan Jeffwan closed this as completed Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants