Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically find failed connection manager workflows and restart them #14043

Closed
lmossman opened this issue Jun 23, 2022 · 3 comments
Closed
Assignees

Comments

@lmossman
Copy link
Contributor

lmossman commented Jun 23, 2022

What

With the plan to automatically set connection manager workflows to failed when they encounter a NonDeterministicException (as described in the parent epic and in this issue), we need something that will automatically repair these failed workflows.

The goal of this issue is to implement a background process that runs as part of the platform, which finds failed connection manager workflows and terminates + restarts them automatically.

How

This can be implemented as a cron in a standalone pod, or a background process/thread in an existing pod. Micronaut's scheduled tasks may be a good candidate here, so that we do not need to manage the scheduling ourselves.

This implementation should utilize the TemporalClient method introduced in #14970 to perform the find+restart logic.

This process should emit a metric whenever it restarts a workflow, similar to those established in #13773

@lmossman
Copy link
Contributor Author

lmossman commented Jul 5, 2022

If going with the Micronaut approach, we probably want to create a new service (i.e. new kube pod) so that this ticket does not also include the work to convert an existing service into Micronaut

@lmossman lmossman changed the title Automatically find terminated/failed connection manager workflows and restart them Automatically find failed connection manager workflows and restart them Jul 22, 2022
@lmossman
Copy link
Contributor Author

lmossman commented Jul 22, 2022

Moving this one back to the backlog, since I have pulled some of the work here out into the other tickets that I linked in the description, e.g. this one which should be completed before this ticket is taken on: #14970
FYI @benmoriceau

@lmossman lmossman removed their assignment Jul 22, 2022
@evantahler
Copy link
Contributor

This story is pending Micronaut scheduler for the "automatic" part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@evantahler @benmoriceau @lmossman and others