Skip to content

Spot instance worker recovery #1539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lapaniku opened this issue Nov 6, 2020 · 2 comments
Closed

Spot instance worker recovery #1539

lapaniku opened this issue Nov 6, 2020 · 2 comments
Labels
question Further information is requested

Comments

@lapaniku
Copy link
Contributor

lapaniku commented Nov 6, 2020

If I have spot instance (or regular on-demand) worker and it fails during the batch processing. How can I ensure that this batch is going to be processed once again?
Or how can I get information about batches, which failed to be processed to start processing them again?

@lapaniku lapaniku added the question Further information is requested label Nov 6, 2020
@vishalbollu
Copy link
Contributor

The Batch API currently only keeps track of the status of the overall job and does monitor the status of each individual batch in the job. Failed batches are currently discarded, making it difficult to do perform retries at the batch level.

I have created these two tickets #1540 and #1541 to address these issues.

Support for Batch is a recent addition to Cortex so there is a lot of room for improvement. I would be happy to jump on a call to discuss workarounds for these issues and other potential improvements that can be made to Cortex. You can reach me at vishal@cortexlabs.com if you are interested.

@deliahu
Copy link
Member

deliahu commented Nov 26, 2020

I'll go ahead and close this, since #1540 and #1541 have been created. Feel free to reach out if you have any other questions!

@deliahu deliahu closed this as completed Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants