Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

Closed
benmoriceau opened this issue Jun 29, 2022 · 3 comments
Closed

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

benmoriceau opened this issue Jun 29, 2022 · 3 comments

Comments

@benmoriceau
Copy link
Contributor

benmoriceau commented Jun 29, 2022

Context

In Q2, our Temporal workflow implementation was improved in a few ways to make them more robust; mainly, we removed the quarantining logic for failed attempts, as it proved to be more painful than helpful in OC situations, and we added logic to make manual operations performed on workflows in an unreachable state "heal" the workflows by restarting them and cleaning the job state.

However, there are a few more changes we want to make to our Temporal implementation this quarter to continue improving reliability and monitoring.

Goals

  • Emit metrics around connection manager workflow restarts to help us understand when workflows are in a bad state (Emit a metric every time we restart a temporal workflow (submitted, run, failed) #13773)
  • Gracefully recover from non-deterministic exceptions when they occur, so that we can make changes to our temporal logic without fear of breaking existing workflows
  • Add tools to allow engineers to manually restart workflows with a given status to aid in OC situations

Metrics

In order to improve our understanding of when connection manager workflows are in a bad state, we need to emit metrics around failures and restarts.

Here is the ticket for temporal metric emission, which includes the details of what needs to be implemented: #13773

Non-deterministic exceptions

Non-deterministic exceptions are thrown by temporal when a change is made to the workflow logic that is incompatible with already-running workflows. For example, if a new activity is added in the middle of the connection manager workflow logic, and there is some workflow that had already made it past that point in the logic, then when temporal tries to restore that workflow into memory and replays its activity history, it won't know how to handle this new activity that does not exist in the history, and therefore throws a NonDeterministicException.

More info about this can be found in the temporal docs here: https://docs.temporal.io/typescript/determinism/

The issue is that with the default temporal configuration, temporal will keep retrying to replay the workflow activity history over and over again when it encounters one of these NonDeterministicExceptions, meaning the only way to resolve this is to manually terminate and restart the workflow, or to roll back the deployment to a version that is consistent with the workflow's activity history.

We would like to change our implementation of temporal to automatically and gracefully recover from non-deterministic exceptions in these cases, so that no manual intervention is required and changes to workflow logic can be made without fear of breaking existing workflows.

This will be done by implementing the following issues in this order:

  1. Implement TemporalClient method to restart all connection manager workflows with a given status #14970
  2. Automatically find failed connection manager workflows and restart them #14043
  3. Fail ConnectionManagerWorkflows when encountering non-deterministic exceptions #14969

See details of each inside the issues. Once the above has been completed in that order, NonDeterministicExceptions should result in connection manager workflows automatically being failed and restarted.

Manually restart workflow OC tools

With the completion of #14970, we will now have a TemporalClient method to restart workflows with a given status. However, this method needs to be exposed of the internal airbyte-worker app for it to be used by OC engineers.

The details of this are captured in this ticket:

This will be the first foray into manual temporal OC tools, and will likely inform future tooling choices.

@lmossman lmossman changed the title [Q3] Remove the need of any manual intervention on temporal [Q3] Improve Temporal Reliability and Monitoring Jul 22, 2022
@evantahler
Copy link
Contributor

evantahler commented Jul 22, 2022

A question about non-deterministic exceptions:
Today, we version our workflows to make sure that old workflows aren't surprised by new/changed activities:

final int attemptCreationVersion =
Workflow.getVersion(CHECK_BEFORE_SYNC_TAG, Workflow.DEFAULT_VERSION, CHECK_BEFORE_SYNC_CURRENT_VERSION);
if (attemptCreationVersion < CHECK_BEFORE_SYNC_CURRENT_VERSION) {
// return early if this instance of the workflow was created beforehand
return checkFailure;
}
, for example.

Once we complete this work, and non-deterministic workflows are restarted reasonably quickly... can we stop worrying about workflow versions in our code by /expecting/ workflows to start over the next time they sync? Can part of this epic be to remove all of our version checks?

Answering this question involves learning more about what happens when a deployment happens while a job is running. If a deployment happens dying the sync phase, when the sync activity is complete and normalization starts, will the whole sync restart, or will the workflow continue on to normalize?

@pmossman
Copy link
Contributor

@xiaohansong has been looking into exporting the built-in Temporal SDK metrics to DataDog for improving Cloud monitors, here's the open PR for reference: https://github.com/airbytehq/airbyte/pull/14842/files

The metrics you're referring to here might be custom metrics but wanted to call this out in case there's any underlying work that should be re-used here

@evantahler evantahler changed the title [Q3] Improve Temporal Reliability and Monitoring [EPIC] Improve Temporal Reliability and Monitoring Aug 1, 2022
@davinchia davinchia changed the title [EPIC] Improve Temporal Reliability and Monitoring [MOVE EPIC] Improve Temporal Reliability and Monitoring Apr 7, 2023
@davinchia
Copy link
Contributor

Initial scope has mostly been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants