Skip to content

Expose failed batches to users #1541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vishalbollu opened this issue Nov 6, 2020 · 0 comments
Closed

Expose failed batches to users #1541

vishalbollu opened this issue Nov 6, 2020 · 0 comments
Labels
enhancement New feature or request
Milestone

Comments

@vishalbollu
Copy link
Contributor

Description

When a batch fails to be processed, the failure counter for the Job is incremented but the batch is discarded.

Rather than discarding failed batches, persist the failed batches to enable users to identify which batches failed and handle failed batches.

Suggestions

  • Cortex enqueues batches onto a queue (SQS) to distribute work across workers. Cortex should also create a dead letter queue to store the failed batches. When a batch fails, workers can enqueue the failed batch onto a dead letter queue. If a job has completed with failures, users can consume the dead letter queue to figure out which batches and retry them afterwards.
  • As batches are being placed onto a queue, Cortex persists each batch and metadata to storage such as S3. Upon the successful completion or failure of a batch, the metadata for that batch is updated accordingly. After the job has completed, users can browse the batch metadata to find the failed batches and resubmit them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants