Automatically find failed connection manager workflows and restart them #14043

lmossman · 2022-06-23T05:09:07Z

What

With the plan to automatically set connection manager workflows to failed when they encounter a NonDeterministicException (as described in the parent epic and in this issue), we need something that will automatically repair these failed workflows.

The goal of this issue is to implement a background process that runs as part of the platform, which finds failed connection manager workflows and terminates + restarts them automatically.

How

This can be implemented as a cron in a standalone pod, or a background process/thread in an existing pod. Micronaut's scheduled tasks may be a good candidate here, so that we do not need to manage the scheduling ourselves.

This implementation should utilize the TemporalClient method introduced in #14970 to perform the find+restart logic.

This process should emit a metric whenever it restarts a workflow, similar to those established in #13773

lmossman · 2022-07-05T18:30:57Z

If going with the Micronaut approach, we probably want to create a new service (i.e. new kube pod) so that this ticket does not also include the work to convert an existing service into Micronaut

lmossman · 2022-07-22T22:00:01Z

Moving this one back to the backlog, since I have pulled some of the work here out into the other tickets that I linked in the description, e.g. this one which should be completed before this ticket is taken on: #14970
FYI @benmoriceau

evantahler · 2022-08-03T20:50:18Z

This story is pending Micronaut scheduler for the "automatic" part

lmossman added the team/platform-move label Jun 23, 2022

lmossman mentioned this issue Jun 23, 2022

Automatically handle non-deterministic temporal errors #13973

Closed

cgardens mentioned this issue Jun 30, 2022

Automatically repair terminated connection manager workflows #12746

Closed

benmoriceau assigned lmossman Jul 6, 2022

evantahler assigned benmoriceau Jul 20, 2022

lmossman changed the title ~~Automatically find terminated/failed connection manager workflows and restart them~~ Automatically find failed connection manager workflows and restart them Jul 22, 2022

lmossman removed their assignment Jul 22, 2022

lmossman mentioned this issue Aug 8, 2022

Fail connection manager workflow on non-deterministic exception #14758

Merged

evantahler mentioned this issue Aug 11, 2022

Create a self-healing scheduler services (x2) #15218

Closed

evantahler closed this as completed Aug 18, 2022

gosusnp mentioned this issue Sep 13, 2022

Update the worker/job documentation. Add documentation for container orchestrator. #16575

Merged

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically find failed connection manager workflows and restart them #14043

Automatically find failed connection manager workflows and restart them #14043

lmossman commented Jun 23, 2022 •

edited

Loading

lmossman commented Jul 5, 2022

lmossman commented Jul 22, 2022 •

edited

Loading

evantahler commented Aug 3, 2022

Automatically find failed connection manager workflows and restart them #14043

Automatically find failed connection manager workflows and restart them #14043

Comments

lmossman commented Jun 23, 2022 • edited Loading

What

How

lmossman commented Jul 5, 2022

lmossman commented Jul 22, 2022 • edited Loading

evantahler commented Aug 3, 2022

lmossman commented Jun 23, 2022 •

edited

Loading

lmossman commented Jul 22, 2022 •

edited

Loading