-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Storage: Abandon prefetched orchestrator messages if the lease is lost #360
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some notes for additional context on this change:
&& DateTime.TryParse(eventRaisedEvent.Name.Substring(3), out var scheduledTime)) | ||
string eventName = eventRaisedEvent.Name; | ||
if (eventName != null && eventName.Length >= 3 && eventName[2] == '@' | ||
&& DateTime.TryParse(eventName.Substring(3), out DateTime scheduledTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fix for a null-ref issue I observed when developing my test. I don't think we would hit it under normal circumstances with Durable Entities since user's don't control the Name
property of raised events, but it's theoretically possible for anyone using DurableTask.AzureStorage to hit this if they use our entity naming convention.
{ | ||
await this.AbandonAndReleaseSessionAsync(session); | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the relevant change. It turned out that we were already setting IsReleased
when we stop listening on a particular queue/lease. Here I'm just checking to see if a control queue was released before we try to process the messages.
@@ -221,6 +221,7 @@ public override Task DeleteMessageAsync(MessageData message, SessionBase session | |||
public void Release() | |||
{ | |||
this.releaseTokenSource.Cancel(); | |||
this.IsReleased = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsReleased
would have been set to true
anyways because of the cancellation token handling logic, but there can be a bit of a delay. I added this additional IsReleased = true
here to speed up the process and make my new test succeed more quickly and reliably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to see a comment that explains this in the code, as it may confuse us in the future otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can add that.
/// with that partition are abandoned and not processed. | ||
/// </summary> | ||
[TestMethod] | ||
public async Task PartitionLost_AbandonPrefetchedSession() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like other tests in this class, we're testing methods on AzureStorageOrchestrationService
rather than going through TaskHubWorker
. We're still going through Azure Storage, but we have a bit more control using this test approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. At most would suggest a few comment clarifications.
@@ -221,6 +221,7 @@ public override Task DeleteMessageAsync(MessageData message, SessionBase session | |||
public void Release() | |||
{ | |||
this.releaseTokenSource.Cancel(); | |||
this.IsReleased = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love to see a comment that explains this in the code, as it may confuse us in the future otherwise.
condition: () => !service.OwnedControlQueues.Any(), | ||
timeout: TimeSpan.FromSeconds(10)); | ||
|
||
// Small additional delay to account for tiny race condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the race condition that we may have gotten rid of the lease but not yet abandoned our work items yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No the work item is abandoned as part of the LockNextTaskOrchestrationWorkItemAsync
call, so there is no racing there. The race being referred to here is the fact that OwnedControlQueues
collection gets updated one instruction before the ControlQueue.Release()
is called, which could mean that TestHelpers.WaitFor
could theoretically complete and move onto the next step of the test too soon. It would be extremely rare, but not impossible. I added this delay to avoid the possibility of flakiness in the test execution.
Resolves Azure/azure-functions-durable-extension#382.
This is a small update to the orchestration work item polling logic in DurableTask.AzureStorage that proactively abandons messages if we've lost the lease for the source queue since fetching. The intent is to mitigate some of the split-brain problems see when partitions move around.
It won't fix 100% of split-brains caused by partition movement, but it should take care of the low-hanging fruit.