Apply new consumption defaults in AzureStorageDurabilityProvider #1706

davidmrdavid · 2021-03-03T01:27:41Z

Addresses: #1646

This PR applies the new concurrency defaults outlined in the issue above. This all done in the AzureStorageDurabilityProviderFactory, where defaults are explicitly overridden after inspecting if the application is running in the consumption plan.

This PR also introduces a new interface: IPlatformInformationService. This interface serves to abstract over the minutia of inspecting environment vars to determine the underlying app service plan and instead exposes self-descriptive methods like InLinuxConsumption and InWindowsConsumption, etc. In the future, I hope this interface will also provide info about the user-facing PL, and other information that can help us guide optimizations and defaults.

Currently, this interface contains only 1 implementation: DefaultPlatformInformationProvider. It uses the same INameResolver injected by DI to look up environment variables and config setting values. This DefaultPlatformInformationProvider is also injected via DI to the DurableTaskExtension.

The rest of changes are all in the tests. Since the DurableTaskExtension constructor now takes an extra argument, a bunch of tests had to be refactored to account for the extra parameter. The guiding strategy here was to creater a new TestHelper which provides a Moq.Mock of the IPlatformInformationService interface. With it, we can easily construct a dummy instance of the interface for platform-agnostic tests. Additionally, this mock is parametrizable and would allow for platform-specific tests moving forward! 🚀

Some Open Questions

Is the "PlatformInformation" portion of IPlatformInformationService really the best descriptor? I want this interface to expose info about the OS, underlying SKU, and, in the future, also the user's PL. Alternatives could be "ApplicationContext", "UserContext", etc.

Remaining ToDos

Potentially modify the upcoming backends to account for this change
Add one or two new tests
Test this on a few OS+SKU combinations
Complete PR checklist

Issue describing the changes in this PR

resolves #1646

Pull request checklist

My changes do not require documentation changes
- Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
My changes should not be added to the release notes for the next release
- Otherwise: I've added my notes to release_notes.md
My changes do not need to be backported to a previous version
- Otherwise: Backport tracked by issue/PR #issue_or_pr
I have added all required tests (Unit tests, E2E tests)

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs

src/WebJobs.Extensions.DurableTask/DefaultPlatformInformationProvider.cs

davidmrdavid · 2021-03-04T17:58:21Z

I'm asking for a review now to make sure we feel comfortable with the current approach, I do realize there are still a few leftovers, which I list above :) . Thanks everyone

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs

src/WebJobs.Extensions.DurableTask/IPlatformInformationService.cs

test/Common/TestHelpers.cs

ConnorMcMahon

Overall, good initial stab at this.

I propose a more pattern matchy approach to platform information that may or may not be a good idea, along with some specific nits about the new defaults.

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs

davidmrdavid · 2021-03-09T17:10:47Z

Performance data for Baseline (Durable-Extension v.2.4.1)

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID	Description	Duration
A	Fanning over 1k activities	20s
B	Fanning over 10k activities	~3min
C	Fanning over 100 sub-orchestrators w/ 100 sequential activities each	~7min
D	Fanning over 100 sub-orchestrators fanning over 10k activities each	~13h and 28min

InstanceIDs to review raw data

Benchmark ID	InstanceID
A	abea706d76c045f890398a81c6ed807a
B	463638f2003e4243aa256736849ea4e5
C	69e8ce5f3acd406b84641ab67ac574e2
D	d901db064ab4458ab8cf8c2a11bd7526

davidmrdavid · 2021-03-09T17:16:50Z

Performance data for commit 7a22217 in this PR / PRv1

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID	Description	Duration
A	Fanning over 1k activities	~37s
B	Fanning over 10k activities	~15min
C	Fanning over 100 sub-orchestrators w/ 100 sequential activities each	~2min 30s
D	Fanning over 100 sub-orchestrators fanning over 10k activities each	~7h

InstanceIDs to review raw data

Benchmark ID	InstanceID
A	71f1fec6ee0f425c865d936ce5e09782
B	7ca022b927c443f9823d0998fecf499e
C	e6de2507482747c7a79d6ac9f1d682b0
D	3a56f220f08841e0845b2e2a937779d2

Update: these results have been updated. Previously, I had the duration for benchmark B as 5 minutes but apparently it was 15 and I had just made a typo. Whoops!

davidmrdavid · 2021-03-10T21:25:28Z

I also just updated the numbers for benchmark D for PRv2 and PRv3 above ^. Before this, they were TBD.

davidmrdavid · 2021-03-10T21:36:33Z

Performance data for PRv4
In this experiment, we keep the same config as in PRv3 but increase the control buffer size to be

 this.azureStorageOptions.ControlQueueBufferThreshold = 96; // was 32 in PRv1, PRv2, and PRv3

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID	Description	Duration
A	Fanning over 1k activities	11s
B	Fanning over 10k activities	~6min
C	Fanning over 100 sub-orchestrators w/ 100 sequential activities each	~3min
D	Fanning over 100 sub-orchestrators fanning over 10k activities each	~3h 12 min

InstanceIDs to review raw data

Benchmark ID	InstanceID
A	6bd8a138a1d647e8b31d077faeb8b696
B	d0e223b60a584179b450fef8ba9c4662
C	6004a761ca4044a99102cae87c8f5d8c
D	e07a9f848db443d88269a59db40c0731

Comments: This seems comparable to the baseline in benchmarks A and B and C, but its 4 times faster than the baseline for benchmark D: a drop from 12 hours to just 3!. It is also faster in benchmark D for PRv1, PRv2, and PRv3, which average about 8 hours. Finally, this version is also faster than PRv1, PRv2, and PRv3 on benchmark B by a factor of 3. I think this is our best configuration so far.

…e-extension into dajusto/new-consumption-defaults

ConnorMcMahon · 2021-03-10T21:57:26Z

We may also need to update our documentation regarding defaults.

That may be tricky if we soon end up with multiple variable playing into these defaults. Maybe we don't document the default values, and say we make a best effort based on platform/language to select intelligent defaults? @cgillum, thoughts about how we document this?

cgillum · 2021-03-10T22:04:38Z

In this experiment, we keep the same config as in PRv3 but increase the control buffer size to be 96

I worry about this change for languages like Python. If all Python threads are blocked waiting for long-running executions to complete, we'll be sitting on these buffered messages and the dequeue counts will increase because of the expirations. It's also not possible for these messages to be load balanced to other instances. It would be good to understand why this change had such an impact to see whether it was a coincidence or whether it really is an impactful change.

Maybe we don't document the default values, and say we make a best effort based on platform/language to select intelligent defaults?

I would agree if the platform were able to adjust per-instance concurrency dynamically, but that's just not the case today. I worry that if we don't document these values then we'll be in trouble because of how critically important this information is to blocking Python workloads.

davidmrdavid · 2021-03-10T22:14:51Z

I'm already working on a documentation update that will show the alternative defaults in this page: https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-bindings#durable-functions-1x

As for @cgillum's concern over PRv4's larger control-queue buffer size, I have two comments.
First, the current global default is of 256 so irrespective of us choosing 32 (PRv1, PRv2, PRv3) or 96 (PRv4), we'd be improving the status quo for python one way of another by lowering the current default size.
Second, I'm happy to go with 32 just for safety's sake, and we can revisit this configuration once we perform OOProc-specific experiments. I just wanted to post PRv4's results anyways since they were so dramatic 😄

ConnorMcMahon · 2021-03-10T22:16:48Z

@cgillum, regarding your first concern, I beleive we have this follow up issue that would address those concerns: #1700. @davidmrdavid, is that in scope for v2.4.2. I thought I remember discussing it being so, but I realize its not being tracked currently that way.

As for documentation, I am imagining that for each of these performance settings, we will now have a matrix of factors contributing to defaults (programming language by app-service-plan). We could definitely document that, but we would likely need to restructure how we document defaults today.

davidmrdavid · 2021-03-10T22:30:29Z

@ConnorMcMahon,
with respect to docs: I also think we will need a restructuring of our docs once we have PL-aware defaults. However, for this PR alone, I don't think that's necessary.

As for whether OOProc performance tuning is in the scope of 2.4.2: I actually remember discussing having OOProc defaults not be a part of 2.4.2. Otherwise I would have already opened a PR for that already. If you feel strongly about having this in the next release, I would need to have until next Wednesday to reasonably test PL-aware defaults 😄 . That being said, I would prefer `to give OOProc performance tuning enough time to do a proper exploration of it.

ConnorMcMahon · 2021-03-10T22:32:47Z

Probably a miscommunication on my part. In that case, let's push it to 2.5.0, but go with the safer defaults for out-of-proc (i.e. 32 instead of 96).

cgillum · 2021-03-10T23:57:59Z

the current global default is of 256 so irrespective of us choosing 32 (PRv1, PRv2, PRv3) or 96 (PRv4), we'd be improving the status quo for python one way of another by lowering the current default size.

Ah, good point! I forgot that we were defaulting to 256 (for some reason I mistakenly thought the default was already 32)! As long as we're improving the status quo, I have no objections. :)

ConnorMcMahon

A few small suggestions and then I think we are good to go here.

src/WebJobs.Extensions.DurableTask/Options/AzureStorageOptions.cs

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs

src/WebJobs.Extensions.DurableTask/IPlatformInformationService.cs

ConnorMcMahon

LGTM!

) These changes incorporate the changes from #1706 and register IPlatformInformationService in AddDurableClientFactory(). Without registering IPlatformInformationService , creating a DurableClient with Azure Storage fails because it can't find an implementation of IPlatformInformationService in AzureStorageDurabilityProviderFactory. This PR also includes a test in DurableClientBaseTests to help us catch if any new services need to be registered in AddDurableClientFactory() in the future.

WIP: Apply new consumption defaults in AzureStorageDurabilityProvider

d725310

davidmrdavid added the needs-discussion label Mar 3, 2021

davidmrdavid requested review from cgillum, amdeel, ConnorMcMahon and bachuv and removed request for cgillum and amdeel March 3, 2021 01:27

ConnorMcMahon reviewed Mar 3, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs Outdated Show resolved Hide resolved

davidmrdavid added 8 commits March 3, 2021 13:47

Refactor PlatformInformation utility to use DI

506a158

Fix linter errors

f41e671

Added testing abstractions platformprovider

6a78b8f

Fixed missing DurableTask constructor

23c8654

Fixed more missing DurableTask constructor

65a7406

Fixed more missing DurableTask constructors

a5f6e65

Apply further test modifications

10f2b62

Fix CustomHelperTypeDependencyInjection test

7a22217

davidmrdavid commented Mar 4, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs Outdated Show resolved Hide resolved

davidmrdavid commented Mar 4, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/DefaultPlatformInformationProvider.cs Show resolved Hide resolved

davidmrdavid marked this pull request as ready for review March 4, 2021 17:53

davidmrdavid requested a review from ConnorMcMahon March 4, 2021 17:53

ConnorMcMahon reviewed Mar 5, 2021

View reviewed changes

src/WebJobs.Extensions.DurableTask/AzureStorageDurabilityProviderFactory.cs Outdated Show resolved Hide resolved

src/WebJobs.Extensions.DurableTask/IPlatformInformationService.cs Show resolved Hide resolved

ConnorMcMahon reviewed Mar 5, 2021

View reviewed changes

test/Common/TestHelpers.cs Show resolved Hide resolved

ConnorMcMahon reviewed Mar 5, 2021

View reviewed changes

Fail fast when platformInformation service is not provided

a38a0ce

davidmrdavid added 3 commits March 10, 2021 13:50

Comment new defaults, validate controlQueueBatchSize

01cb6b8

Merge branch 'dev' of https://github.com/Azure/azure-functions-durabl…

a1f5760

…e-extension into dajusto/new-consumption-defaults

Documented changes in release notes

dbc3ff5

davidmrdavid added 3 commits March 10, 2021 15:06

Add pending docs

87de988

Added simple test

733bc7d

Provide complete test set, detected bug

72ea9dc

davidmrdavid added 5 commits March 10, 2021 16:42

Add nullable options

50d34dd

Respect customer defaults

d801423

revert version cchange

330893a

fix if-statement swap

a71d813

Satisfy all tests

92e4042

ConnorMcMahon reviewed Mar 11, 2021

View reviewed changes

Apply PR feedback

7697565

davidmrdavid requested a review from ConnorMcMahon March 12, 2021 00:14

ConnorMcMahon approved these changes Mar 12, 2021

View reviewed changes

davidmrdavid merged commit 85d0642 into dev Mar 12, 2021

davidmrdavid deleted the dajusto/new-consumption-defaults branch March 12, 2021 00:44

bachuv mentioned this pull request Mar 12, 2021

Register IPlatformInformationService in AddDurableClientFactory() #1727

Merged

7 tasks

mpaul31 mentioned this pull request May 21, 2021

Longer running time after updating to version 2.4.3 #1842

Closed

davidmrdavid mentioned this pull request May 24, 2021

Increase ControlQueueBufferThreshold defaults in Consumption #1846

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply new consumption defaults in AzureStorageDurabilityProvider #1706

Apply new consumption defaults in AzureStorageDurabilityProvider #1706

davidmrdavid commented Mar 3, 2021 •

edited

Loading

davidmrdavid commented Mar 4, 2021

ConnorMcMahon left a comment

davidmrdavid commented Mar 9, 2021 •

edited

Loading

davidmrdavid commented Mar 9, 2021 •

edited

Loading

davidmrdavid commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021

ConnorMcMahon commented Mar 10, 2021

cgillum commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021 •

edited

Loading

ConnorMcMahon commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021

ConnorMcMahon commented Mar 10, 2021

cgillum commented Mar 10, 2021

ConnorMcMahon left a comment

ConnorMcMahon left a comment

Apply new consumption defaults in AzureStorageDurabilityProvider #1706

Apply new consumption defaults in AzureStorageDurabilityProvider #1706

Conversation

davidmrdavid commented Mar 3, 2021 • edited Loading

Issue describing the changes in this PR

Pull request checklist

davidmrdavid commented Mar 4, 2021

ConnorMcMahon left a comment

Choose a reason for hiding this comment

davidmrdavid commented Mar 9, 2021 • edited Loading

davidmrdavid commented Mar 9, 2021 • edited Loading

davidmrdavid commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021

ConnorMcMahon commented Mar 10, 2021

cgillum commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021 • edited Loading

ConnorMcMahon commented Mar 10, 2021

davidmrdavid commented Mar 10, 2021

ConnorMcMahon commented Mar 10, 2021

cgillum commented Mar 10, 2021

ConnorMcMahon left a comment

Choose a reason for hiding this comment

ConnorMcMahon left a comment

Choose a reason for hiding this comment

davidmrdavid commented Mar 3, 2021 •

edited

Loading

davidmrdavid commented Mar 9, 2021 •

edited

Loading

davidmrdavid commented Mar 9, 2021 •

edited

Loading

davidmrdavid commented Mar 10, 2021 •

edited

Loading