-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs failing when a node is preempted #999
Comments
Issue-Label Bot is automatically applying the label Links: app homepage, dashboard and code for this bot. |
I'd appreciate any help with this. I saw one error on a worker that was preempted, saying it did not have enough Note when there is a failure in the code, e.g. |
Can you provide logs of worker and controller? Is this similar to #366? |
Any tips on how to recreate this, so I can monitor everything as it happens? If I delete the VM instance, then everything works correctly. The workers on that node stop, a new node is scheduled, and they start back up again. Could there be something different that happens in a real preemption? (Are the GPUs sometimes detached from preemptible instances while the instances themselves keep running?) |
Aha, it just happened. On the dashboard I see Doing
The worker's logs do not have any error message, but are just cut off abruptly. The chief has the error message that happens when one of the other workers or parameter servers goes down:
Doing
Here are the relevant logs from tf-job-operator at the time of the failure
thanks for any help |
I wonder if this was the individual GPU running out of memory and failing (I was trying to push its limits). Before I had seen an error message about going out of memory, which I didn't see here. |
@matthen If the resource is not enough, the pod cannot be scheduled and it may be failed. But we do not show the real reason in TFJob since we think it can be shown in pods. I am not sure if we need show the info in TFJob. You know, there are so many problems that can fail a pod. |
The pod is killed and disappeared. So I think it would be useful to keep |
I am not sure how to implement it. Should we aggregate all status of all PS/workers into TFJob status? |
@matthen When a gke gpu node is preempted, it is recreated. However, if the node is recreated shortly after being preempted (before the pods have been evicted ~5min), then upon node startup, the preexisting pods will still be running on the node, potentially before all system pods are finished setting up. |
Has anyone found a solution/workaround? The issue persists in gke version 1.16.0 which is supposed to include the commits made in the PRs mentioned above by @chardch |
/area engprod |
Is there any progress regarding this issue? Without solving this issue, distributed training using kubeflow and preemptible GPU nodes is impossible. Doesn't Kubeflow claim to do both? |
+1 we have had to switch to non preemptible nodes for tfjobs to work |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Are there any fixes for this, or can we not use kubeflow with preemptible GPUs? @jtfogarty @jbottum @jlewi |
I think I have fixed this issue while refactoring the whole project @gaocegege @Jeffwan is there a way to allow people to use latest tf-operator image? |
(I ended up switching to regular k8s services + jobs, and adding logic in the workers + parameter servers themselves to make sure they exit successfully at the end of training) |
@ChanYiLin Release infra and test-infra are blocked currently. We have to wait for a while or do manual release. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there any progress on this issue? |
On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.
I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:
Is there something that needs to be done to make tfjobs handle pre-empted nodes?
The text was updated successfully, but these errors were encountered: