-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MOVE EPIC] Improve Temporal Reliability and Monitoring #14277
Comments
A question about non-deterministic exceptions: Lines 353 to 359 in a064d4b
Once we complete this work, and non-deterministic workflows are restarted reasonably quickly... can we stop worrying about workflow versions in our code by /expecting/ workflows to start over the next time they sync? Can part of this epic be to remove all of our version checks? Answering this question involves learning more about what happens when a deployment happens while a job is running. If a deployment happens dying the sync phase, when the sync activity is complete and normalization starts, will the whole sync restart, or will the workflow continue on to normalize? |
@xiaohansong has been looking into exporting the built-in Temporal SDK metrics to DataDog for improving Cloud monitors, here's the open PR for reference: https://github.com/airbytehq/airbyte/pull/14842/files The metrics you're referring to here might be custom metrics but wanted to call this out in case there's any underlying work that should be re-used here |
Initial scope has mostly been done. |
Context
In Q2, our Temporal workflow implementation was improved in a few ways to make them more robust; mainly, we removed the quarantining logic for failed attempts, as it proved to be more painful than helpful in OC situations, and we added logic to make manual operations performed on workflows in an unreachable state "heal" the workflows by restarting them and cleaning the job state.
However, there are a few more changes we want to make to our Temporal implementation this quarter to continue improving reliability and monitoring.
Goals
Metrics
In order to improve our understanding of when connection manager workflows are in a bad state, we need to emit metrics around failures and restarts.
Here is the ticket for temporal metric emission, which includes the details of what needs to be implemented: #13773
Non-deterministic exceptions
Non-deterministic exceptions are thrown by temporal when a change is made to the workflow logic that is incompatible with already-running workflows. For example, if a new activity is added in the middle of the connection manager workflow logic, and there is some workflow that had already made it past that point in the logic, then when temporal tries to restore that workflow into memory and replays its activity history, it won't know how to handle this new activity that does not exist in the history, and therefore throws a NonDeterministicException.
More info about this can be found in the temporal docs here: https://docs.temporal.io/typescript/determinism/
The issue is that with the default temporal configuration, temporal will keep retrying to replay the workflow activity history over and over again when it encounters one of these NonDeterministicExceptions, meaning the only way to resolve this is to manually terminate and restart the workflow, or to roll back the deployment to a version that is consistent with the workflow's activity history.
We would like to change our implementation of temporal to automatically and gracefully recover from non-deterministic exceptions in these cases, so that no manual intervention is required and changes to workflow logic can be made without fear of breaking existing workflows.
This will be done by implementing the following issues in this order:
See details of each inside the issues. Once the above has been completed in that order, NonDeterministicExceptions should result in connection manager workflows automatically being failed and restarted.
Manually restart workflow OC tools
With the completion of #14970, we will now have a TemporalClient method to restart workflows with a given status. However, this method needs to be exposed of the internal airbyte-worker app for it to be used by OC engineers.
The details of this are captured in this ticket:
This will be the first foray into manual temporal OC tools, and will likely inform future tooling choices.
The text was updated successfully, but these errors were encountered: