DT.AzureStorage: Poison message handling for corrupt orchestration state #794
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are two main changes in this PR, both motivated by IcM 331554589:
Add pop receipt data in message logging
There are certain
MessageGone
scenarios that are difficult to debug because there is no pop-receipt information in the message traces. This PR adds this extra telemetry so that we can more easily see when the system is confused about which version of a message it is working with.This change would have been useful in detecting a negative feedback loop situation where slow orchestration processing resulted in an infinite loop of dequeuing duplicate messages (the fix for the negative feedback loop problem is out of scope for this PR).
Add defense against corrupt history events
History corruption will often result in poison-message scenarios. The root cause of the corruption for this particular CRI appears to be related to duplicate sub-orchestration execution. This PR doesn't attempt to fix the root cause, but rather to fix the poison message scenario it generates. Without this PR, we retry messages for the corrupt orchestration over and over again. Because these failures result in unhandled exceptions in DT.Core, DT.Core will also start slowing down orchestration processing significantly.
The fix works by identifying corrupt orchestration state (missing a start event) and deleting the messages associated with the invalid orchestration without attempting to save any other changes.