Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DT.AzureStorage: Poison message handling for corrupt orchestration state #794

Merged
merged 2 commits into from
Sep 8, 2022

Conversation

cgillum
Copy link
Member

@cgillum cgillum commented Sep 7, 2022

There are two main changes in this PR, both motivated by IcM 331554589:

Add pop receipt data in message logging

There are certain MessageGone scenarios that are difficult to debug because there is no pop-receipt information in the message traces. This PR adds this extra telemetry so that we can more easily see when the system is confused about which version of a message it is working with.

This change would have been useful in detecting a negative feedback loop situation where slow orchestration processing resulted in an infinite loop of dequeuing duplicate messages (the fix for the negative feedback loop problem is out of scope for this PR).

Add defense against corrupt history events

History corruption will often result in poison-message scenarios. The root cause of the corruption for this particular CRI appears to be related to duplicate sub-orchestration execution. This PR doesn't attempt to fix the root cause, but rather to fix the poison message scenario it generates. Without this PR, we retry messages for the corrupt orchestration over and over again. Because these failures result in unhandled exceptions in DT.Core, DT.Core will also start slowing down orchestration processing significantly.

The fix works by identifying corrupt orchestration state (missing a start event) and deleting the messages associated with the invalid orchestration without attempting to save any other changes.

Copy link
Collaborator

@jviau jviau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like the targeted fix to the issue - this is a safe solution.

@cgillum
Copy link
Member Author

cgillum commented Sep 8, 2022

I got confirmation from the customer that this resolves their issue. Merging (also, FYI @yell0wfl4sh in case this PR might be useful for the issues you were seeing).

@cgillum cgillum merged commit 290abd2 into main Sep 8, 2022
@cgillum cgillum deleted the cgillum/corruption2 branch September 8, 2022 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants