Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing SIGTERM to entrypoint to be able to handle SPOT failures gracefully in user-code #173

Open
croth1 opened this issue Jan 31, 2023 · 0 comments

Comments

@croth1
Copy link

croth1 commented Jan 31, 2023

Describe the feature you'd like
I would love to be able to make use of SIGTERM handling used by modern ML frameworks such as pytorch_lightning. If I understood correctly, when the spot failure is announced, the container receives a SIGTERM and has 120 seconds time before it is forcefully terminated. I would like to be able to get the signal passed down to the entry point in order to make use of the SIGTERM handling callbacks provided by those frameworks.

How would this feature be used? Please describe.
The 120 seconds can be used for writing out a checkpoint and gracefully terminating the experiment when using an experiment tracker.

Describe alternatives you've considered
One can just not use the last 120 seconds and start from the last checkpoint written out by the model and just accept that spot instance failures are marked as "failed" experiments in the MLflow experiment tracker.

Additional context
During my journey getting to the bottom of this problem, I created a small proof-of-concept what changes would be necessary to make it work in my specific case (i.e. being able to handle SIGTERMs in a shell script entry point (which could be passed down to pytorch_lightning training script), see here for an example: https://github.com/croth1/sagemaker-toolkit-sigterm-handling

Only few changes are necessary to make this work in my specific case, see: https://github.com/aws/sagemaker-training-toolkit/compare/master...croth1:sigterm_forwarding?expand=1. HOWEVER this is just a proof-of-concept as there are many paths in the code-base eventually leading to entrypoint execution and this is fixing only the one I used.

I hope this is of interest :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant