PS still running after tfjob is complete #774

jlewi · 2018-08-09T00:00:36Z

Hi,
I have an issue that the PS pod keeps running after the tfjob is complete, even after several days.
kubectl get pods returns
'''
trainer-180804-004108-master-0 0/1 Completed 0 4d
trainer-180804-004108-ps-0 1/1 Running 0 4d
trainer-180804-004108-worker-0 0/1 Completed 0 4d
'''

And, kubectl get tfjob returns
'''
tfReplicaStatuses:
MASTER:
succeeded: 1
PS:
active: 1
Worker:
succeeded: 1
'''

jlewi · 2018-08-09T00:03:28Z

Looks like we aren't verifying that pods are deleted
https://github.com/kubeflow/tf-operator/blob/d2509aae732ae1eb23dd8754bcc3b74f19df51cd/py/test_runner.py#L555

gaocegege · 2018-08-09T03:19:29Z

@gaoning777

How about your clean policy?

ankushagarwal · 2018-08-09T04:46:53Z

@gaoning777

What version of tfjob are you using? The cleanPodPolicy might not work in earlier versions of tfjob

I used cleanPodPolicy in my tfjob spec and it cleaned up all pods as expected after the job completed. Here is my complete tfjob spec

apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
  name: "linear09"
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs
    Chief:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/agwlkubeflow/linear-regression-estimator:latest
              command: ["python", "/train.py", "--model_dir=/mnt/kubeflow-gcfs/linear09"]
              volumeMounts:
              - mountPath: /mnt/kubeflow-gcfs
                name: kubeflow-gcfs
          volumes:
          - name: kubeflow-gcfs
            persistentVolumeClaim:
              claimName: kubeflow-gcfs

jlewi · 2018-08-09T13:22:00Z

@ankushagarwal what's the default CleanPodPolicy? I think the default should be to delete running pods as that is the most sensible thing.

I thought that's what we were using as the default.

ankushagarwal · 2018-08-09T15:57:23Z

The default is to leave pods running.

https://github.com/kubeflow/tf-operator/blob/1fa0779840816b772a1c113c14220a2464d04ac0/pkg/apis/tensorflow/v1alpha2/defaults.go#L91

gaoning777 · 2018-08-09T22:52:41Z

I was using the default clean policy.
I tried the 'cleanPodPolicy: All' just now, but to no avail. The PS is still running after 1 hour.
I'm using kubeflow v1alpha2.

gaoning777 · 2018-08-09T23:38:46Z

My yaml looks like this:

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  generateName: trainer-
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    MASTER:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: ******
            name: tensorflow
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: *****
            name: tensorflow
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - -m
            - trainer.task
            image: *****
            name: tensorflow

jlewi · 2018-08-10T01:14:27Z

@gaoning777 What are the events for the TFJob? I'd like to know if we tried to delete the pod.

There are instructions here for dumping the events
https://www.kubeflow.org/docs/guides/monitoring/#default-stackdriver

You can ping us internally in the Kubeflow chat room with relevant stackdriver information.

jlewi · 2018-08-10T13:12:59Z

#750 is for E2E test for CleanPodPolicy.

jlewi · 2018-08-10T13:18:44Z

@ankushagarwal The default CleanPodPolicy is Running
https://github.com/kubeflow/tf-operator/blob/1fa0779840816b772a1c113c14220a2464d04ac0/pkg/apis/tensorflow/v1alpha2/defaults.go#L94

That means we will delete pods that are still Running which is the right default.
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tfcontroller/tfjob.go#L88

ChanYiLin · 2018-08-12T08:58:40Z

@jlewi @ankushagarwal @gaocegege
I think I found the root cause.
The reason is when the last time we refactored the code (#767 ), we put functions like GetPodsForJob, GetServicesForJob, DeletePod, DeleteService under JobController rather than TFJobController.
We cannot call these function under tc.XXX(), therefore, this issue happened.
See #776
Thanks!

ScorpioCPH · 2018-08-13T07:08:46Z

@ChanYiLin I think #767 is ok as we used embedded field in TFJobController struct.

ChanYiLin · 2018-08-13T07:18:54Z

yes, its my fault. I have just tested it and found the problem is not there...
Sorry guys

jlewi · 2018-08-13T13:49:02Z

Can anyone reproduce the problem with the prods not being deleted?

ChanYiLin · 2018-08-14T10:26:04Z

I've tested it using GCP with KUBEFLOW_VERSION=0.2.2 and tf-operator v1alpha2.
Kubeflow killed all the pods after the job succeeded as expected.

The yaml file I used is as follow

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  name: test
spec:
  cleanPodPolicy: All
  tfReplicaSpecs:
    MASTER:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - command:
            - python
            - tf_cnn_benchmarks.py
            - --batch_size=8
            - --model=resnet50
            - --data_format=NHWC
            - --device=cpu
            - --variable_update=parameter_server
            - --flush_stdout=true
            - --num_gpus=1
            - --num_batches=10
            - --sync_on_finish=false
            - --cross_replica_sync=false
            - --num_warmup_batches=0
            image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
            name: tensorflow
            workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks

gaoning777 · 2018-08-14T16:09:54Z

Where can I find the new version? The most recent v0.2.2 release(https://github.com/kubeflow/kubeflow/releases) was published on July 12nd.

jlewi · 2018-08-14T20:40:17Z

@gaoning777 What is the docker image for tf-operator that you are using?

ashahba · 2018-08-15T17:05:36Z

@jlewi I also put some comments here: tensorflow/tensorflow#20833
but it may well apply to tf-operator.
Strangely enough if I used Tensorflow 1.8.0 the TfJob is marked as Succeeded but with Tensorflow 1.9.0 it remains Running indefinitely, however when using 1.8.0 I see this in kubetail logs, for all Succeeded jobs 🤔

[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-1-wgdla] INFO:root:Session from worker 1 closed cleanly 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:tensorflow:Coordinator stopped with threads still running: QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] Exception in thread QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany: 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] Traceback (most recent call last): 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self.run() 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/lib/python3.5/threading.py", line 862, in run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self._target(*self._args, **self._kwargs) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 268, in _run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] coord.request_stop(e) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 213, in request_stop 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] six.reraise(*sys.exc_info()) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] raise value 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] enqueue_callable() 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1244, in _single_operation_run 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] self._call_tf_sessionrun(None, {}, [], target_list, None) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] run_metadata) 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`. 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy]  
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:root:Finished on task 0 in 262.2973277568817 seconds 
[my-tf-dist-8a0d7766-d562-4e19-827f-00c62-worker-vm6i-0-emdpy] INFO:root:Session from worker 0 closed cleanly

richardsliu · 2018-08-18T01:01:44Z

This can be consistently reproduced if the tf operator is using image tf_operator:v0.2.0. The latest tf operator image (I was using tf_operator:v20180809-d2509aa) does not have this issue.

richardsliu · 2018-08-18T01:25:19Z

@gaoning777

This should be fixed in 0.2.3. If you have a dependency on Kubeflow 0.2.2, you can fix this by doing something like:

export KUBEFLOW_DEPLOY=false
Run deploy.sh
ks param set tf-job-operator tfJobImage gcr.io/kubeflow-images-public/tf_operator:v20180809-d2509aa
export KUBEFLOW_DEPLOY=true
Run deploy.sh again

gaoning777 · 2018-08-18T01:37:56Z

Thanks Richard for looking into this.
The ideal would be to enable the user to specify the tf-operator version to the deploy script such that there is a clear decoupling of these two projects. As such, user will not depend on the new release of kubeflow to upgrade their tf-operator version on the assumption that we need one-click deployment.

jlewi · 2018-08-20T00:21:21Z

@gaoning777 You can customize your ksonnet app if you want to override the image. We don't want to plumb through more options to deploy.sh. Instead the pattern is to create the ksonnet application, let the user customize it, and then deploy.

jlewi · 2018-08-20T00:22:26Z

@richardsliu Can we

Update the TFJob operator image on master
Add an E2E test to verify the processes are being deleted correctly.

richardsliu · 2018-08-20T18:12:46Z

The TFJob operator image on master is already pointing to the latest image. I believe @gaoning777 is depending on the 0.2.2 release.

richardsliu · 2018-08-21T22:06:01Z

This release has the fix: https://github.com/kubeflow/kubeflow/releases/tag/v0.2.4-rc.0

richardsliu · 2018-08-23T18:18:32Z

Closing this since the issue is fixed. Will send out a separate PR for the e2etest.

jlewi added priority/p1 area/0.3.0 area/tfjob labels Aug 9, 2018

jlewi mentioned this issue Aug 9, 2018

PS still running after tfjob is complete. kubeflow/kubeflow#1334

Closed

ChanYiLin mentioned this issue Aug 12, 2018

fix the bugs due to refactoring code #776

Closed

richardsliu closed this as completed Aug 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PS still running after tfjob is complete #774

PS still running after tfjob is complete #774

jlewi commented Aug 9, 2018

jlewi commented Aug 9, 2018

gaocegege commented Aug 9, 2018

ankushagarwal commented Aug 9, 2018

jlewi commented Aug 9, 2018

ankushagarwal commented Aug 9, 2018

gaoning777 commented Aug 9, 2018 •

edited

Loading

gaoning777 commented Aug 9, 2018 •

edited

Loading

jlewi commented Aug 10, 2018

jlewi commented Aug 10, 2018

jlewi commented Aug 10, 2018

ChanYiLin commented Aug 12, 2018 •

edited

Loading

ScorpioCPH commented Aug 13, 2018

ChanYiLin commented Aug 13, 2018

jlewi commented Aug 13, 2018

ChanYiLin commented Aug 14, 2018

gaoning777 commented Aug 14, 2018

jlewi commented Aug 14, 2018

ashahba commented Aug 15, 2018 •

edited

Loading

richardsliu commented Aug 18, 2018

richardsliu commented Aug 18, 2018

gaoning777 commented Aug 18, 2018

jlewi commented Aug 20, 2018

jlewi commented Aug 20, 2018

richardsliu commented Aug 20, 2018

richardsliu commented Aug 21, 2018

richardsliu commented Aug 23, 2018

PS still running after tfjob is complete #774

PS still running after tfjob is complete #774

Comments

jlewi commented Aug 9, 2018

jlewi commented Aug 9, 2018

gaocegege commented Aug 9, 2018

ankushagarwal commented Aug 9, 2018

jlewi commented Aug 9, 2018

ankushagarwal commented Aug 9, 2018

gaoning777 commented Aug 9, 2018 • edited Loading

gaoning777 commented Aug 9, 2018 • edited Loading

jlewi commented Aug 10, 2018

jlewi commented Aug 10, 2018

jlewi commented Aug 10, 2018

ChanYiLin commented Aug 12, 2018 • edited Loading

ScorpioCPH commented Aug 13, 2018

ChanYiLin commented Aug 13, 2018

jlewi commented Aug 13, 2018

ChanYiLin commented Aug 14, 2018

gaoning777 commented Aug 14, 2018

jlewi commented Aug 14, 2018

ashahba commented Aug 15, 2018 • edited Loading

richardsliu commented Aug 18, 2018

richardsliu commented Aug 18, 2018

gaoning777 commented Aug 18, 2018

jlewi commented Aug 20, 2018

jlewi commented Aug 20, 2018

richardsliu commented Aug 20, 2018

richardsliu commented Aug 21, 2018

richardsliu commented Aug 23, 2018

gaoning777 commented Aug 9, 2018 •

edited

Loading

gaoning777 commented Aug 9, 2018 •

edited

Loading

ChanYiLin commented Aug 12, 2018 •

edited

Loading

ashahba commented Aug 15, 2018 •

edited

Loading