Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for gracefully exiting when preemptive instance are being retracted. #9567

Closed
tchaton opened this issue Sep 16, 2021 · 0 comments
Assignees
Labels
feature Is an improvement or enhancement
Milestone

Comments

@tchaton
Copy link
Contributor

tchaton commented Sep 16, 2021

🚀 Feature

Right now, when the script is being killed, a fault tolerant checkpoint would be saved.
However, Lightning can't always ensure it is always fully happening in reproducible part of the codebase.

When running in the cloud, there is 2-3 min to kill the script.

The proposal is to add a mechanism to detect a killing signal as been sent and Lightning will terminate on the next reproducible part in this time windown.

Motivation

Pitch

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants