Skip to content

Expose failed batches to users #1541

@vishalbollu

Description

@vishalbollu

Description

When a batch fails to be processed, the failure counter for the Job is incremented but the batch is discarded.

Rather than discarding failed batches, persist the failed batches to enable users to identify which batches failed and handle failed batches.

Suggestions

  • Cortex enqueues batches onto a queue (SQS) to distribute work across workers. Cortex should also create a dead letter queue to store the failed batches. When a batch fails, workers can enqueue the failed batch onto a dead letter queue. If a job has completed with failures, users can consume the dead letter queue to figure out which batches and retry them afterwards.
  • As batches are being placed onto a queue, Cortex persists each batch and metadata to storage such as S3. Upon the successful completion or failure of a batch, the metadata for that batch is updated accordingly. After the job has completed, users can browse the batch metadata to find the failed batches and resubmit them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions