Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail connection manager workflow on non-deterministic exception #14758

Merged
merged 10 commits into from
Oct 3, 2022

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented Jul 15, 2022

What

Resolves #13973

How

Uses the temporal api to have non-deterministic exceptions cause workflows to be marked as failed, rather than retrying indefinitely.

Note: This should probably only be merged once this other issue has been completed, so that the failed workflows are automatically restarted when this occurs: #14043

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Jul 15, 2022
@lmossman lmossman temporarily deployed to more-secrets July 15, 2022 18:37 Inactive
@benmoriceau benmoriceau marked this pull request as ready for review August 8, 2022 10:48
@benmoriceau benmoriceau temporarily deployed to more-secrets August 8, 2022 10:51 Inactive
Copy link
Contributor

@jdpgrailsdev jdpgrailsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@gosusnp gosusnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, how complex would it be to write a test for this?

@lmossman
Copy link
Contributor Author

lmossman commented Aug 8, 2022

I'm not exactly sure what a test for this would look like, since it would require changing the order or names of activities for a running workflow, which I've only done by changing code. Might be possible though - probably worth having someone spend some time looking into that.

@lmossman lmossman changed the title Fail connection manager workflow on non-deterministic exception [DO NOT MERGE] Fail connection manager workflow on non-deterministic exception Aug 8, 2022
@lmossman lmossman changed the title [DO NOT MERGE] Fail connection manager workflow on non-deterministic exception Fail connection manager workflow on non-deterministic exception Aug 8, 2022
@lmossman
Copy link
Contributor Author

lmossman commented Aug 8, 2022

@benmoriceau I added a note to the description of this ticket - it's up to you all, but we may want to wait to wait to merge this until the issue to find and restart failed workflows has been completed (#14043).

If we merge this first, then non-deterministic exceptions will cause workflows to be set to Failed without anything automatically fixing them. Though, I'm not sure if this is actually worse or better than the current behavior (Temporal just retrying the workflows indefinitely, running into the non-deterministic exception over and over). It feels like the current behavior is probably preferable, as fixing that just requires rolling back to the previous deployment, which should result in workflows automatically recovering. Whereas if we mark them as failed without the automatic recovery process, then we will have to manually go and restart a bunch of workflows even if we roll back.

I'll leave that decision up to the team!

@benmoriceau benmoriceau temporarily deployed to more-secrets August 11, 2022 18:58 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets August 18, 2022 17:14 Inactive
….com:airbytehq/airbyte into lmossman/fail-on-non-deterministic-exception
@benmoriceau benmoriceau temporarily deployed to more-secrets August 19, 2022 15:24 Inactive
@evantahler
Copy link
Contributor

#14043 has been merged. @jdpgrailsdev / @benmoriceau it it time to merge this?

@lmossman
Copy link
Contributor Author

@evantahler if we merge this before we have the process in place that automatically restarts failed workflows (I think this ticket #15218 ?), then non-deterministic exceptions will result in workflows being set to Failed with nothing automatically restarting them.

See my comment here - I think it may be slightly preferable to keep the current behavior until we have that, but its up to you all

@benmoriceau benmoriceau requested a review from a team as a code owner September 26, 2022 22:25
@github-actions github-actions bot added the area/frontend Related to the Airbyte webapp label Sep 26, 2022
@benmoriceau benmoriceau temporarily deployed to more-secrets September 26, 2022 22:27 Inactive
@benmoriceau benmoriceau temporarily deployed to more-secrets October 3, 2022 15:35 Inactive
@benmoriceau
Copy link
Contributor

This is ready to be merged, I will do it once green.

@benmoriceau benmoriceau merged commit 2345971 into master Oct 3, 2022
@benmoriceau benmoriceau deleted the lmossman/fail-on-non-deterministic-exception branch October 3, 2022 17:01
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
…ytehq#14758)

* fail connection manager workflow on non-deterministic exception

* Update where the config is added

Co-authored-by: Benoit Moriceau <benoit@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend Related to the Airbyte webapp area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically handle non-deterministic temporal errors
5 participants