-
Notifications
You must be signed in to change notification settings - Fork 42
Labels
Milestone
Description
Summary
Add built-in retry handling in the State Manager so ERRORED states are requeued with backoff until a max attempts limit, aligning with the documented lifecycle and fault-tolerance goals. ([Exosphere Docs]1)
Why
The architecture mentions retry mechanisms and error handling, but concrete behavior and knobs are not yet specified in the State Manager service. Implementing this at the service that manages state lifecycles keeps it consistent across runtimes and APIs. ([Exosphere Docs]1)
Scope
- On transition to ERRORED, if attempts < max_retries, schedule a retry and move the state back to QUEUED after backoff. Keep the existing lifecycle names. ([Exosphere Docs]1)
- Persist per-state counters and next attempt time so retries survive restarts.
- Defaults configurable at service level; optional per-graph override can follow later.
- Idempotency guard in runtimes by state id to avoid duplicate execution.
- Logs and basic metrics for attempts, successes after retry, and exhausted retries.
Config
- Users should be able to add as config in graph template with some default value and method of retry
- Keep existing required envs unchanged. ([Exosphere Docs]2)
Acceptance criteria
- Failing node is retried up to MAX_RETRIES with exponential backoff.
- Final outcome remains ERRORED when retries are exhausted; otherwise proceeds normally.
- Counters and next attempt timestamps are persisted and visible via API or logs.
- Unit and integration tests demonstrate requeue from ERRORED to QUEUED and success after retry. ([Exosphere Docs]1)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done