Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply new consumption defaults in AzureStorageDurabilityProvider #1706

Merged
merged 25 commits into from
Mar 12, 2021

Conversation

davidmrdavid
Copy link
Contributor

@davidmrdavid davidmrdavid commented Mar 3, 2021

Addresses: #1646

This PR applies the new concurrency defaults outlined in the issue above. This all done in the AzureStorageDurabilityProviderFactory, where defaults are explicitly overridden after inspecting if the application is running in the consumption plan.

This PR also introduces a new interface: IPlatformInformationService. This interface serves to abstract over the minutia of inspecting environment vars to determine the underlying app service plan and instead exposes self-descriptive methods like InLinuxConsumption and InWindowsConsumption, etc. In the future, I hope this interface will also provide info about the user-facing PL, and other information that can help us guide optimizations and defaults.

Currently, this interface contains only 1 implementation: DefaultPlatformInformationProvider. It uses the same INameResolver injected by DI to look up environment variables and config setting values. This DefaultPlatformInformationProvider is also injected via DI to the DurableTaskExtension.

The rest of changes are all in the tests. Since the DurableTaskExtension constructor now takes an extra argument, a bunch of tests had to be refactored to account for the extra parameter. The guiding strategy here was to creater a new TestHelper which provides a Moq.Mock of the IPlatformInformationService interface. With it, we can easily construct a dummy instance of the interface for platform-agnostic tests. Additionally, this mock is parametrizable and would allow for platform-specific tests moving forward! 🚀

Some Open Questions

  • Is the "PlatformInformation" portion of IPlatformInformationService really the best descriptor? I want this interface to expose info about the OS, underlying SKU, and, in the future, also the user's PL. Alternatives could be "ApplicationContext", "UserContext", etc.

Remaining ToDos

  • Potentially modify the upcoming backends to account for this change
  • Add one or two new tests
  • Test this on a few OS+SKU combinations
  • Complete PR checklist

Issue describing the changes in this PR

resolves #1646

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)

@davidmrdavid davidmrdavid requested review from cgillum, amdeel, ConnorMcMahon and bachuv and removed request for cgillum and amdeel March 3, 2021 01:27
@davidmrdavid davidmrdavid marked this pull request as ready for review March 4, 2021 17:53
@davidmrdavid
Copy link
Contributor Author

I'm asking for a review now to make sure we feel comfortable with the current approach, I do realize there are still a few leftovers, which I list above :) . Thanks everyone

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, good initial stab at this.

I propose a more pattern matchy approach to platform information that may or may not be a good idea, along with some specific nits about the new defaults.

@davidmrdavid
Copy link
Contributor Author

davidmrdavid commented Mar 9, 2021

Performance data for Baseline (Durable-Extension v.2.4.1)

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID Description Duration
A Fanning over 1k activities 20s
B Fanning over 10k activities ~3min
C Fanning over 100 sub-orchestrators w/ 100 sequential activities each ~7min
D Fanning over 100 sub-orchestrators fanning over 10k activities each ~13h and 28min

InstanceIDs to review raw data

Benchmark ID InstanceID
A abea706d76c045f890398a81c6ed807a
B 463638f2003e4243aa256736849ea4e5
C 69e8ce5f3acd406b84641ab67ac574e2
D d901db064ab4458ab8cf8c2a11bd7526

@davidmrdavid
Copy link
Contributor Author

davidmrdavid commented Mar 9, 2021

Performance data for commit 7a22217 in this PR / PRv1

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID Description Duration
A Fanning over 1k activities ~37s
B Fanning over 10k activities ~15min
C Fanning over 100 sub-orchestrators w/ 100 sequential activities each ~2min 30s
D Fanning over 100 sub-orchestrators fanning over 10k activities each ~7h

InstanceIDs to review raw data

Benchmark ID InstanceID
A 71f1fec6ee0f425c865d936ce5e09782
B 7ca022b927c443f9823d0998fecf499e
C e6de2507482747c7a79d6ac9f1d682b0
D 3a56f220f08841e0845b2e2a937779d2

Update: these results have been updated. Previously, I had the duration for benchmark B as 5 minutes but apparently it was 15 and I had just made a typo. Whoops!

@davidmrdavid
Copy link
Contributor Author

I also just updated the numbers for benchmark D for PRv2 and PRv3 above ^. Before this, they were TBD.

@davidmrdavid
Copy link
Contributor Author

Performance data for PRv4
In this experiment, we keep the same config as in PRv3 but increase the control buffer size to be

 this.azureStorageOptions.ControlQueueBufferThreshold = 96; // was 32 in PRv1, PRv2, and PRv3

Note: The Duration column is calculated as the time difference between the orchestrator's FunctionStarting to FunctionCompleted events.

Time to completion

Benchmark ID Description Duration
A Fanning over 1k activities 11s
B Fanning over 10k activities ~6min
C Fanning over 100 sub-orchestrators w/ 100 sequential activities each ~3min
D Fanning over 100 sub-orchestrators fanning over 10k activities each ~3h 12 min

InstanceIDs to review raw data

Benchmark ID InstanceID
A 6bd8a138a1d647e8b31d077faeb8b696
B d0e223b60a584179b450fef8ba9c4662
C 6004a761ca4044a99102cae87c8f5d8c
D e07a9f848db443d88269a59db40c0731

Comments: This seems comparable to the baseline in benchmarks A and B and C, but its 4 times faster than the baseline for benchmark D: a drop from 12 hours to just 3!. It is also faster in benchmark D for PRv1, PRv2, and PRv3, which average about 8 hours. Finally, this version is also faster than PRv1, PRv2, and PRv3 on benchmark B by a factor of 3. I think this is our best configuration so far.

@ConnorMcMahon
Copy link
Contributor

We may also need to update our documentation regarding defaults.

That may be tricky if we soon end up with multiple variable playing into these defaults. Maybe we don't document the default values, and say we make a best effort based on platform/language to select intelligent defaults? @cgillum, thoughts about how we document this?

@cgillum
Copy link
Member

cgillum commented Mar 10, 2021

In this experiment, we keep the same config as in PRv3 but increase the control buffer size to be 96

I worry about this change for languages like Python. If all Python threads are blocked waiting for long-running executions to complete, we'll be sitting on these buffered messages and the dequeue counts will increase because of the expirations. It's also not possible for these messages to be load balanced to other instances. It would be good to understand why this change had such an impact to see whether it was a coincidence or whether it really is an impactful change.

Maybe we don't document the default values, and say we make a best effort based on platform/language to select intelligent defaults?

I would agree if the platform were able to adjust per-instance concurrency dynamically, but that's just not the case today. I worry that if we don't document these values then we'll be in trouble because of how critically important this information is to blocking Python workloads.

@davidmrdavid
Copy link
Contributor Author

davidmrdavid commented Mar 10, 2021

I'm already working on a documentation update that will show the alternative defaults in this page: https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-bindings#durable-functions-1x

As for @cgillum's concern over PRv4's larger control-queue buffer size, I have two comments.
First, the current global default is of 256 so irrespective of us choosing 32 (PRv1, PRv2, PRv3) or 96 (PRv4), we'd be improving the status quo for python one way of another by lowering the current default size.
Second, I'm happy to go with 32 just for safety's sake, and we can revisit this configuration once we perform OOProc-specific experiments. I just wanted to post PRv4's results anyways since they were so dramatic 😄

@ConnorMcMahon
Copy link
Contributor

@cgillum, regarding your first concern, I beleive we have this follow up issue that would address those concerns: #1700. @davidmrdavid, is that in scope for v2.4.2. I thought I remember discussing it being so, but I realize its not being tracked currently that way.

As for documentation, I am imagining that for each of these performance settings, we will now have a matrix of factors contributing to defaults (programming language by app-service-plan). We could definitely document that, but we would likely need to restructure how we document defaults today.

@davidmrdavid
Copy link
Contributor Author

@ConnorMcMahon,
with respect to docs: I also think we will need a restructuring of our docs once we have PL-aware defaults. However, for this PR alone, I don't think that's necessary.

As for whether OOProc performance tuning is in the scope of 2.4.2: I actually remember discussing having OOProc defaults not be a part of 2.4.2. Otherwise I would have already opened a PR for that already. If you feel strongly about having this in the next release, I would need to have until next Wednesday to reasonably test PL-aware defaults 😄 . That being said, I would prefer `to give OOProc performance tuning enough time to do a proper exploration of it.

@ConnorMcMahon
Copy link
Contributor

Probably a miscommunication on my part. In that case, let's push it to 2.5.0, but go with the safer defaults for out-of-proc (i.e. 32 instead of 96).

@cgillum
Copy link
Member

cgillum commented Mar 10, 2021

the current global default is of 256 so irrespective of us choosing 32 (PRv1, PRv2, PRv3) or 96 (PRv4), we'd be improving the status quo for python one way of another by lowering the current default size.

Ah, good point! I forgot that we were defaulting to 256 (for some reason I mistakenly thought the default was already 32)! As long as we're improving the status quo, I have no objections. :)

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions and then I think we are good to go here.

Copy link
Contributor

@ConnorMcMahon ConnorMcMahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidmrdavid davidmrdavid merged commit 85d0642 into dev Mar 12, 2021
@davidmrdavid davidmrdavid deleted the dajusto/new-consumption-defaults branch March 12, 2021 00:44
bachuv added a commit that referenced this pull request Mar 13, 2021
)

These changes incorporate the changes from #1706 and register IPlatformInformationService in AddDurableClientFactory(). Without registering IPlatformInformationService , creating a DurableClient with Azure Storage fails because it can't find an implementation of IPlatformInformationService in AzureStorageDurabilityProviderFactory. This PR also includes a test in DurableClientBaseTests to help us catch if any new services need to be registered in AddDurableClientFactory() in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reliability: Improve concurrency defaults for Consumption plan hosting
3 participants