DT.AzureStorage: Poison message handling for corrupt orchestration state #794

cgillum · 2022-09-07T01:20:43Z

There are two main changes in this PR, both motivated by IcM 331554589:

Add pop receipt data in message logging

There are certain MessageGone scenarios that are difficult to debug because there is no pop-receipt information in the message traces. This PR adds this extra telemetry so that we can more easily see when the system is confused about which version of a message it is working with.

This change would have been useful in detecting a negative feedback loop situation where slow orchestration processing resulted in an infinite loop of dequeuing duplicate messages (the fix for the negative feedback loop problem is out of scope for this PR).

Add defense against corrupt history events

History corruption will often result in poison-message scenarios. The root cause of the corruption for this particular CRI appears to be related to duplicate sub-orchestration execution. This PR doesn't attempt to fix the root cause, but rather to fix the poison message scenario it generates. Without this PR, we retry messages for the corrupt orchestration over and over again. Because these failures result in unhandled exceptions in DT.Core, DT.Core will also start slowing down orchestration processing significantly.

The fix works by identifying corrupt orchestration state (missing a start event) and deleting the messages associated with the invalid orchestration without attempting to save any other changes.

jviau

LGTM. I like the targeted fix to the issue - this is a safe solution.

cgillum · 2022-09-08T20:46:34Z

I got confirmation from the customer that this resolves their issue. Merging (also, FYI @yell0wfl4sh in case this PR might be useful for the issues you were seeing).

cgillum added 2 commits September 6, 2022 18:06

Add logging for message pop-receipts

3b0e2e1

Add defense against corrupt history state

bf59e01

cgillum requested review from jviau and amdeel September 7, 2022 01:20

cgillum mentioned this pull request Sep 7, 2022

DT.AzureStorage: Background renewal for pending orchestrator messages #792

Closed

jviau approved these changes Sep 7, 2022

View reviewed changes

cgillum merged commit 290abd2 into main Sep 8, 2022

cgillum deleted the cgillum/corruption2 branch September 8, 2022 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DT.AzureStorage: Poison message handling for corrupt orchestration state #794

DT.AzureStorage: Poison message handling for corrupt orchestration state #794

cgillum commented Sep 7, 2022

jviau left a comment •

edited

Loading

cgillum commented Sep 8, 2022

DT.AzureStorage: Poison message handling for corrupt orchestration state #794

DT.AzureStorage: Poison message handling for corrupt orchestration state #794

Conversation

cgillum commented Sep 7, 2022

Add pop receipt data in message logging

Add defense against corrupt history events

jviau left a comment • edited Loading

Choose a reason for hiding this comment

cgillum commented Sep 8, 2022

jviau left a comment •

edited

Loading