[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

benmoriceau · 2022-06-29T23:04:48Z

Context

In Q2, our Temporal workflow implementation was improved in a few ways to make them more robust; mainly, we removed the quarantining logic for failed attempts, as it proved to be more painful than helpful in OC situations, and we added logic to make manual operations performed on workflows in an unreachable state "heal" the workflows by restarting them and cleaning the job state.

However, there are a few more changes we want to make to our Temporal implementation this quarter to continue improving reliability and monitoring.

Goals

Emit metrics around connection manager workflow restarts to help us understand when workflows are in a bad state (Emit a metric every time we restart a temporal workflow (submitted, run, failed) #13773)
Gracefully recover from non-deterministic exceptions when they occur, so that we can make changes to our temporal logic without fear of breaking existing workflows
Add tools to allow engineers to manually restart workflows with a given status to aid in OC situations

Metrics

In order to improve our understanding of when connection manager workflows are in a bad state, we need to emit metrics around failures and restarts.

Here is the ticket for temporal metric emission, which includes the details of what needs to be implemented: #13773

Non-deterministic exceptions

Non-deterministic exceptions are thrown by temporal when a change is made to the workflow logic that is incompatible with already-running workflows. For example, if a new activity is added in the middle of the connection manager workflow logic, and there is some workflow that had already made it past that point in the logic, then when temporal tries to restore that workflow into memory and replays its activity history, it won't know how to handle this new activity that does not exist in the history, and therefore throws a NonDeterministicException.

More info about this can be found in the temporal docs here: https://docs.temporal.io/typescript/determinism/

The issue is that with the default temporal configuration, temporal will keep retrying to replay the workflow activity history over and over again when it encounters one of these NonDeterministicExceptions, meaning the only way to resolve this is to manually terminate and restart the workflow, or to roll back the deployment to a version that is consistent with the workflow's activity history.

We would like to change our implementation of temporal to automatically and gracefully recover from non-deterministic exceptions in these cases, so that no manual intervention is required and changes to workflow logic can be made without fear of breaking existing workflows.

This will be done by implementing the following issues in this order:

See details of each inside the issues. Once the above has been completed in that order, NonDeterministicExceptions should result in connection manager workflows automatically being failed and restarted.

Manually restart workflow OC tools

With the completion of #14970, we will now have a TemporalClient method to restart workflows with a given status. However, this method needs to be exposed of the internal airbyte-worker app for it to be used by OC engineers.

The details of this are captured in this ticket:

Expose the restart workflow method so it can be manually executed #14972

This will be the first foray into manual temporal OC tools, and will likely inform future tooling choices.

evantahler · 2022-07-22T22:13:22Z

A question about non-deterministic exceptions:
Today, we version our workflows to make sure that old workflows aren't surprised by new/changed activities:

airbyte/airbyte-workers/src/main/java/io/airbyte/workers/temporal/scheduling/ConnectionManagerWorkflowImpl.java

Lines 353 to 359 in a064d4b

    
           final int attemptCreationVersion = 
        
               Workflow.getVersion(CHECK_BEFORE_SYNC_TAG, Workflow.DEFAULT_VERSION, CHECK_BEFORE_SYNC_CURRENT_VERSION); 
        
           if (attemptCreationVersion < CHECK_BEFORE_SYNC_CURRENT_VERSION) { 
        
             // return early if this instance of the workflow was created beforehand 
        
             return checkFailure; 
        
           }

, for example.

Once we complete this work, and non-deterministic workflows are restarted reasonably quickly... can we stop worrying about workflow versions in our code by /expecting/ workflows to start over the next time they sync? Can part of this epic be to remove all of our version checks?

Answering this question involves learning more about what happens when a deployment happens while a job is running. If a deployment happens dying the sync phase, when the sync activity is complete and normalization starts, will the whole sync restart, or will the workflow continue on to normalize?

pmossman · 2022-07-22T22:21:54Z

@xiaohansong has been looking into exporting the built-in Temporal SDK metrics to DataDog for improving Cloud monitors, here's the open PR for reference: https://github.com/airbytehq/airbyte/pull/14842/files

The metrics you're referring to here might be custom metrics but wanted to call this out in case there's any underlying work that should be re-used here

davinchia · 2023-08-31T22:46:37Z

Initial scope has mostly been done.

benmoriceau added team/platform-move Epic labels Jun 29, 2022

lmossman changed the title ~~[Q3] Remove the need of any manual intervention on temporal~~ [Q3] Improve Temporal Reliability and Monitoring Jul 22, 2022

This was referenced Jul 22, 2022

Automatically find failed connection manager workflows and restart them #14043

Closed

Automatically handle non-deterministic temporal errors #13973

Closed

evantahler changed the title ~~[Q3] Improve Temporal Reliability and Monitoring~~ [EPIC] Improve Temporal Reliability and Monitoring Aug 1, 2022

This was referenced Aug 1, 2022

allow configuring different temporal workflow/activity settings for cloud vs OSS #9887

Closed

Potential race condition in the ConnectionManagerWorkflow #14601

Closed

evantahler assigned benmoriceau Aug 9, 2022

davinchia changed the title ~~[EPIC] Improve Temporal Reliability and Monitoring~~ [MOVE EPIC] Improve Temporal Reliability and Monitoring Apr 7, 2023

davinchia closed this as completed Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

benmoriceau commented Jun 29, 2022 •

edited by lmossman

Loading

evantahler commented Jul 22, 2022 •

edited

Loading

pmossman commented Jul 22, 2022

davinchia commented Aug 31, 2023

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277

Comments

benmoriceau commented Jun 29, 2022 • edited by lmossman Loading

Context

Goals

Metrics

Non-deterministic exceptions

Manually restart workflow OC tools

evantahler commented Jul 22, 2022 • edited Loading

pmossman commented Jul 22, 2022

davinchia commented Aug 31, 2023

benmoriceau commented Jun 29, 2022 •

edited by lmossman

Loading

evantahler commented Jul 22, 2022 •

edited

Loading