Skip to content

Add support for retries in StateManager #181

@NiveditJain

Description

@NiveditJain

Summary

Add built-in retry handling in the State Manager so ERRORED states are requeued with backoff until a max attempts limit, aligning with the documented lifecycle and fault-tolerance goals. ([Exosphere Docs]1)

Why

The architecture mentions retry mechanisms and error handling, but concrete behavior and knobs are not yet specified in the State Manager service. Implementing this at the service that manages state lifecycles keeps it consistent across runtimes and APIs. ([Exosphere Docs]1)

Scope

  • On transition to ERRORED, if attempts < max_retries, schedule a retry and move the state back to QUEUED after backoff. Keep the existing lifecycle names. ([Exosphere Docs]1)
  • Persist per-state counters and next attempt time so retries survive restarts.
  • Defaults configurable at service level; optional per-graph override can follow later.
  • Idempotency guard in runtimes by state id to avoid duplicate execution.
  • Logs and basic metrics for attempts, successes after retry, and exhausted retries.

Config

  • Users should be able to add as config in graph template with some default value and method of retry
  • Keep existing required envs unchanged. ([Exosphere Docs]2)

Acceptance criteria

  • Failing node is retried up to MAX_RETRIES with exponential backoff.
  • Final outcome remains ERRORED when retries are exhausted; otherwise proceeds normally.
  • Counters and next attempt timestamps are persisted and visible via API or logs.
  • Unit and integration tests demonstrate requeue from ERRORED to QUEUED and success after retry. ([Exosphere Docs]1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions