-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply new consumption defaults in AzureStorageDurabilityProvider #1706
Conversation
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/DefaultPlatformInformationProvider.cs
Show resolved
Hide resolved
I'm asking for a review now to make sure we feel comfortable with the current approach, I do realize there are still a few leftovers, which I list above :) . Thanks everyone |
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, good initial stab at this.
I propose a more pattern matchy approach to platform information that may or may not be a good idea, along with some specific nits about the new defaults.
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Outdated
Show resolved
Hide resolved
Performance data for Baseline (Durable-Extension v.2.4.1) Note: The Duration column is calculated as the time difference between the orchestrator's Time to completion
InstanceIDs to review raw data
|
Performance data for commit Note: The Duration column is calculated as the time difference between the orchestrator's Time to completion
InstanceIDs to review raw data
Update: these results have been updated. Previously, I had the duration for benchmark B as 5 minutes but apparently it was 15 and I had just made a typo. Whoops! |
I also just updated the numbers for benchmark D for PRv2 and PRv3 above ^. Before this, they were TBD. |
Performance data for PRv4 this.azureStorageOptions.ControlQueueBufferThreshold = 96; // was 32 in PRv1, PRv2, and PRv3 Note: The Duration column is calculated as the time difference between the orchestrator's Time to completion
InstanceIDs to review raw data
Comments: This seems comparable to the baseline in benchmarks A and B and C, but its 4 times faster than the baseline for benchmark D: a drop from 12 hours to just 3!. It is also faster in benchmark D for PRv1, PRv2, and PRv3, which average about 8 hours. Finally, this version is also faster than PRv1, PRv2, and PRv3 on benchmark B by a factor of 3. I think this is our best configuration so far. |
We may also need to update our documentation regarding defaults. That may be tricky if we soon end up with multiple variable playing into these defaults. Maybe we don't document the default values, and say we make a best effort based on platform/language to select intelligent defaults? @cgillum, thoughts about how we document this? |
I worry about this change for languages like Python. If all Python threads are blocked waiting for long-running executions to complete, we'll be sitting on these buffered messages and the dequeue counts will increase because of the expirations. It's also not possible for these messages to be load balanced to other instances. It would be good to understand why this change had such an impact to see whether it was a coincidence or whether it really is an impactful change.
I would agree if the platform were able to adjust per-instance concurrency dynamically, but that's just not the case today. I worry that if we don't document these values then we'll be in trouble because of how critically important this information is to blocking Python workloads. |
I'm already working on a documentation update that will show the alternative defaults in this page: https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-bindings#durable-functions-1x As for @cgillum's concern over PRv4's larger control-queue buffer size, I have two comments. |
@cgillum, regarding your first concern, I beleive we have this follow up issue that would address those concerns: #1700. @davidmrdavid, is that in scope for v2.4.2. I thought I remember discussing it being so, but I realize its not being tracked currently that way. As for documentation, I am imagining that for each of these performance settings, we will now have a matrix of factors contributing to defaults (programming language by app-service-plan). We could definitely document that, but we would likely need to restructure how we document defaults today. |
@ConnorMcMahon, As for whether OOProc performance tuning is in the scope of 2.4.2: I actually remember discussing having OOProc defaults not be a part of 2.4.2. Otherwise I would have already opened a PR for that already. If you feel strongly about having this in the next release, I would need to have until next Wednesday to reasonably test PL-aware defaults 😄 . That being said, I would prefer `to give OOProc performance tuning enough time to do a proper exploration of it. |
Probably a miscommunication on my part. In that case, let's push it to 2.5.0, but go with the safer defaults for out-of-proc (i.e. 32 instead of 96). |
Ah, good point! I forgot that we were defaulting to 256 (for some reason I mistakenly thought the default was already 32)! As long as we're improving the status quo, I have no objections. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions and then I think we are good to go here.
src/WebJobs.Extensions.DurableTask/Options/AzureStorageOptions.cs
Outdated
Show resolved
Hide resolved
src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
) These changes incorporate the changes from #1706 and register IPlatformInformationService in AddDurableClientFactory(). Without registering IPlatformInformationService , creating a DurableClient with Azure Storage fails because it can't find an implementation of IPlatformInformationService in AzureStorageDurabilityProviderFactory. This PR also includes a test in DurableClientBaseTests to help us catch if any new services need to be registered in AddDurableClientFactory() in the future.
Addresses: #1646
This PR applies the new concurrency defaults outlined in the issue above. This all done in the
AzureStorageDurabilityProviderFactory
, where defaults are explicitly overridden after inspecting if the application is running in the consumption plan.This PR also introduces a new interface:
IPlatformInformationService
. This interface serves to abstract over the minutia of inspecting environment vars to determine the underlying app service plan and instead exposes self-descriptive methods likeInLinuxConsumption
andInWindowsConsumption
, etc. In the future, I hope this interface will also provide info about the user-facing PL, and other information that can help us guide optimizations and defaults.Currently, this interface contains only 1 implementation:
DefaultPlatformInformationProvider
. It uses the sameINameResolver
injected by DI to look up environment variables and config setting values. ThisDefaultPlatformInformationProvider
is also injected via DI to theDurableTaskExtension
.The rest of changes are all in the tests. Since the
DurableTaskExtension
constructor now takes an extra argument, a bunch of tests had to be refactored to account for the extra parameter. The guiding strategy here was to creater a newTestHelper
which provides aMoq.Mock
of theIPlatformInformationService
interface. With it, we can easily construct a dummy instance of the interface for platform-agnostic tests. Additionally, this mock is parametrizable and would allow for platform-specific tests moving forward! 🚀Some Open Questions
IPlatformInformationService
really the best descriptor? I want this interface to expose info about the OS, underlying SKU, and, in the future, also the user's PL. Alternatives could be "ApplicationContext", "UserContext", etc.Remaining ToDos
Issue describing the changes in this PR
resolves #1646
Pull request checklist
pending_docs.md
release_notes.md