-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DT.AzureStorage: add randomization to history blob names to avoid races #891
Conversation
…ces in split-brain situations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt patch. Just some questions!
@@ -1241,7 +1241,10 @@ static string GetBlobName(DynamicTableEntity entity, string property) | |||
throw new InvalidOperationException($"Could not compute the blob name for property {property}"); | |||
} | |||
|
|||
string blobName = $"{sanitizedInstanceId}/history-{sequenceNumber}-{eventType}-{property}.json.gz"; | |||
// randomize the blob name to prevent accidental races in split-brain situations (#890) | |||
uint random = (uint)(new Random()).Next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for us not to use a GUID in this case? I would suspect that may be more "unique" than a random number, but perhaps that's incorrect.
Also - do we need to be concerned about creating orphaned blobs as a result of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - do we need to be concerned about creating orphaned blobs as a result of this?
I don't have it in front of me, but it would be good to check our purge logic to see if it might end up missing orphaned blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A GUID of 128 bit is indeed more random than the 32bit used here. But for this particular purpose I think 32bit is more than enough - split brain does not happen frequently enough to make collisions likely. I didn't want to waste so much space in the blob name.
The purge operation seems to wipe out the entire blob directory belonging to the instance id, so no problem there.
One thing I am not sure about is the case where a history is replaced (e.g. after ContinueAsNew). There should be some code somewhere that removes blobs for the old history (even without the proposed randomization) but I haven't quite spotted it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the current code some more. It simply does not remove any old blobs when updating histories - so those blobs may accumulate until purging the instance. However, importantly, in some select cases, it can overwrite old blobs with a new blob of the same name. Tor example, when calling continue as new and the new blob happens to have the same event type, and the same large field, and be in the same row of the history, then the blob is overwritten.
This latter case is the important one for entities, since they often have large inputs (that is where the entity state is stored) so that state can be overwritten every time the entity does continue-as-new, which is all the time.
This does mean this PR currently has a problem with entities (and eternal orchestrators) since the blobs will accumulate forever. It does not seem very easy to fix since the TrackingStore implementation does not know whether the old history contained any large blobs that need to be removed. We can either issue a query, or track that information somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the issues were addressed.
…n updating the tracking store after a continue-as-new
I have implemented what I think is an o.k. solution, after some discussions. We still need add some tests though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left more preliminary questions
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite uncomfortable with this change as it's currently implemented, mainly for two reasons:
- We're adding new public surface area that doesn't add any value to the abstraction
- The code is becoming more complicated since we need to cast interfaces and add new conditionals
As far as I can tell, ITrackingStore
is a public interface but we don't expose any way for 3rd parties to introduce their own tracking stores. We own all the implementations. For that reason, I'd much prefer we make a breaking change to ITrackingStore
rather than creating a new ITrackingStoreXXX
interface.
I am all for it - would be my preference also. Definitely cleaner and simpler. Will update the PR accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more questions
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more questions
Co-authored-by: David Justo <david.justo.1996@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I suspect we may want to wait on @cgillum's approval too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I suspect we may want to wait on @cgillum's approval too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I suspect we may want to wait on @cgillum's approval too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, though I don't think any of them are necessarily blocking.
@@ -216,6 +216,12 @@ public Task<string> DownloadAndDecompressAsBytesAsync(Uri blobUri) | |||
return DownloadAndDecompressAsBytesAsync(blob); | |||
} | |||
|
|||
public Task<bool> DeleteOrphanedBlobAsync(string blobName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we call this method DeleteBlobAsync
? I don't see any logic here related to orphaned data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
@@ -69,7 +69,8 @@ interface ITrackingStore | |||
/// <param name="instanceId">InstanceId for the Orchestration Update</param> | |||
/// <param name="executionId">ExecutionId for the Orchestration Update</param> | |||
/// <param name="eTag">The ETag value to use for safe updates</param> | |||
Task<string> UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, string instanceId, string executionId, string eTag); | |||
/// <param name="trackingStoreData">The additional data that is maintained for this execution.</param> | |||
Task<string> UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, string instanceId, string executionId, string eTag, object trackingStoreData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: trackingStoreContext
might be a better word since "data" is heavily overloaded in this context (it would almost read as if this "data" is part of what we're updating).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it.
@@ -32,7 +33,7 @@ namespace DurableTask.AzureStorage.Tracking | |||
/// <summary> | |||
/// Tracking store for use with <see cref="AzureStorageOrchestrationService"/>. Uses Azure Tables and Azure Blobs to store runtime state. | |||
/// </summary> | |||
class AzureTableTrackingStore : TrackingStoreBase | |||
class AzureTableTrackingStore : TrackingStoreBase, ITrackingStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrackingStoreBase
already implements ITrackingStore
so you can remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely.
@@ -35,11 +35,20 @@ public OrchestrationHistory(IList<HistoryEvent> historyEvents, DateTime lastChec | |||
this.LastCheckpointTime = lastCheckpointTime; | |||
this.ETag = eTag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this overload to call your new overload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
// We had to wait until the new history has committed to make sure the blobs are no longer necessary. | ||
if (blobsToDelete != null) | ||
{ | ||
var tasks = new List<Task>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var tasks = new List<Task>(); | |
var tasks = new List<Task>(blobsToDelete.Count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
uint random = (uint)(new Random()).Next(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to allocate a new Random()
each time or can we use a static Random
instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little tricky because the Random
class is unfortunately not thread-safe, so I would have to do something complicated or ugly. I think it is better to keep it simple to read and maintain, and the performance is probably fine considering that this is only called when we need a large blob, which is by itself going to use significant CPU and memory (so this little allocation is a tiny fraction of the cost).
We have installed a preview version of this lib (
immediately after this, we see this error:
At this point, the orchestration that calls our entity halts in the "running" state, and never continues as new as we would expect. To unstick we have had to purge orchestration history and start new. This has happened on 3 occasions over the past week. |
… 2.9.4 to 2.9.5 in /packages/web-workers (#2436) Bumps [Microsoft.Azure.WebJobs.Extensions.DurableTask](https://github.com/Azure/azure-functions-durable-extension) from 2.9.4 to 2.9.5. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/Azure/azure-functions-durable-extension/releases">Microsoft.Azure.WebJobs.Extensions.DurableTask's releases</a>.</em></p> <blockquote> <h2>v2.9.5</h2> <p><a href="https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.DurableTask/2.9.5">https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.DurableTask/2.9.5</a></p> <h1>Bug Fixes</h1> <h2>Microsoft.Azure.WebJobs.Extensions.DurableTask</h2> <ul> <li>Fix handling of in-flight orchestrations and activities during host shutdown. (<a href="https://redirect.github.com/Azure/azure-functions-durable-extension/issues/2456">#2456</a>) <ul> <li>Previously these were considered "failed", now they will be retried.</li> <li>This only affected dotnet-isolated and java workers.</li> </ul> </li> </ul> <h2>Microsoft.Azure.DurableTask.AzureStorage</h2> <ul> <li>Add randomization to history blob names to avoid races (Azure/durabletask/pull/891)</li> </ul> <h1>Dependency Updates</h1> <ul> <li><a href="https://github.com/Azure/durabletask/releases/tag/durabletask.azurestorage-v1.13.7">Microsoft.Azure.DurableTask.AzureStorage to v1.13.7</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Azure/azure-functions-durable-extension/commit/3a8795d39db0c74dabe65f8992ffbe6a01e080fb"><code>3a8795d</code></a> Ensure function abort on shutdown (<a href="https://redirect.github.com/Azure/azure-functions-durable-extension/issues/2456">#2456</a>) (<a href="https://redirect.github.com/Azure/azure-functions-durable-extension/issues/2477">#2477</a>)</li> <li>See full diff in <a href="https://github.com/Azure/azure-functions-durable-extension/compare/v2.9.4...v2.9.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Microsoft.Azure.WebJobs.Extensions.DurableTask&package-manager=nuget&previous-version=2.9.4&new-version=2.9.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
As observed in #890 and Azure/azure-functions-durable-extension#2437, the current e-tag-based arbitration mechanism does not account for possible races on the blobs. The losing host may overwrite the blob of the winning host, causing inconsistent states.
This PR proposes a fix that adds randomization to blob names used for storing large history fields.
This randomization means that competing hosts will use different blob names to store the property, preventing the race.