Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Storage: Abandon prefetched orchestrator messages if the lease is lost #360

Merged
merged 2 commits into from
Jan 6, 2020

Conversation

cgillum
Copy link
Member

@cgillum cgillum commented Jan 6, 2020

Resolves Azure/azure-functions-durable-extension#382.

This is a small update to the orchestration work item polling logic in DurableTask.AzureStorage that proactively abandons messages if we've lost the lease for the source queue since fetching. The intent is to mitigate some of the split-brain problems see when partitions move around.

It won't fix 100% of split-brains caused by partition movement, but it should take care of the low-hanging fruit.

@cgillum cgillum added the durable functions This issue is an ask from the Durable Functions team. label Jan 6, 2020
@cgillum cgillum requested a review from ConnorMcMahon January 6, 2020 18:20
@cgillum cgillum self-assigned this Jan 6, 2020
Copy link
Member Author

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes for additional context on this change:

&& DateTime.TryParse(eventRaisedEvent.Name.Substring(3), out var scheduledTime))
string eventName = eventRaisedEvent.Name;
if (eventName != null && eventName.Length >= 3 && eventName[2] == '@'
&& DateTime.TryParse(eventName.Substring(3), out DateTime scheduledTime))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fix for a null-ref issue I observed when developing my test. I don't think we would hit it under normal circumstances with Durable Entities since user's don't control the Name property of raised events, but it's theoretically possible for anyone using DurableTask.AzureStorage to hit this if they use our entity naming convention.

{
await this.AbandonAndReleaseSessionAsync(session);
return null;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the relevant change. It turned out that we were already setting IsReleased when we stop listening on a particular queue/lease. Here I'm just checking to see if a control queue was released before we try to process the messages.

@@ -221,6 +221,7 @@ public override Task DeleteMessageAsync(MessageData message, SessionBase session
public void Release()
{
this.releaseTokenSource.Cancel();
this.IsReleased = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsReleased would have been set to true anyways because of the cancellation token handling logic, but there can be a bit of a delay. I added this additional IsReleased = true here to speed up the process and make my new test succeed more quickly and reliably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see a comment that explains this in the code, as it may confuse us in the future otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can add that.

/// with that partition are abandoned and not processed.
/// </summary>
[TestMethod]
public async Task PartitionLost_AbandonPrefetchedSession()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like other tests in this class, we're testing methods on AzureStorageOrchestrationService rather than going through TaskHubWorker. We're still going through Azure Storage, but we have a bit more control using this test approach.

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. At most would suggest a few comment clarifications.

@@ -221,6 +221,7 @@ public override Task DeleteMessageAsync(MessageData message, SessionBase session
public void Release()
{
this.releaseTokenSource.Cancel();
this.IsReleased = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see a comment that explains this in the code, as it may confuse us in the future otherwise.

condition: () => !service.OwnedControlQueues.Any(),
timeout: TimeSpan.FromSeconds(10));

// Small additional delay to account for tiny race condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the race condition that we may have gotten rid of the lease but not yet abandoned our work items yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the work item is abandoned as part of the LockNextTaskOrchestrationWorkItemAsync call, so there is no racing there. The race being referred to here is the fact that OwnedControlQueues collection gets updated one instruction before the ControlQueue.Release() is called, which could mean that TestHelpers.WaitFor could theoretically complete and move onto the next step of the test too soon. It would be extremely rare, but not impossible. I added this delay to avoid the possibility of flakiness in the test execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
durable functions This issue is an ask from the Durable Functions team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lease stealing can result in duplicate function execution
2 participants