Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

oluatte · 2018-07-13T16:07:11Z

Hey folks,

Thanks for all your hard work on durable functions. I've used them on multiple projects and i'm a big fan.

On a recent project, we ran into an issue that occurs with durable functions that interact with azure storage. For example:

Durable orchestrator receives a batch of event data to save.
Durable orchestrator fans out activity functions to save event data as small blobs in storage
At an interval, another Durable orchestrator fans out activity functions to combine the small blobs into one large blob for upload to another data store e.g. data lake.

This setup causes the "Host Threshold Exceeded (Connections)" error very quickly. I was able to monitor the outbound connections while running the function app locally and there are too many connections to storage (queues, tables, blobs). I even commented out a bunch of our code and there were 120+ outbound connections to storage within 60 seconds (I have the IPs available) and it only kept increasing.

I think there is an issue here. Either the outbound connections from the durable framework shouldn't count against the sandbox limits or the durable framework should minimize the number of outbound connections it makes. Taking up so many of the available connections really limits what is left over for the actual logic of the function.

Please note that we followed all the connection management recommendations (lazy init static clients etc).

Some details about the function app:

Function Runtime: 2.0.11888.0
Durable Extension Version: 1.5.0
Region: East US 2
Invocation Id: 4c91d7e6-a206-4997-919d-b9555ce9b2c8

cgillum · 2018-07-13T16:20:30Z

Very interesting. That's definitely not expected and I've not seen this kind of issue with Durable Functions running in our Consumption plan yet.

Just to make sure I'm understanding correctly, you're pretty confident that nothing in your code could be causing these connections to build-up, right?

I'll need to see if I can reproduce this locally - but I agree that using so many connections seems excessive. Under the hood we're just using the Azure Storage client SDK and caching client objects, so I wouldn't expect there to be any connection leaks.

cgillum · 2018-07-13T16:55:24Z

I ran a quick load test on my local box and sure enough, I see the number of established TCP connections spike up to almost 200. In my case, it wasn't a fan-out/fan-in, but rather a large number of sequential orchestrations running concurrently.

As a potential workaround, could you try limiting the concurrency using the concurrency throttles?

oluatte · 2018-07-13T17:15:43Z

Yeah, It's an interesting issue for sure. I've used functions (durable and non durable) for high throughput scenarios on several projects and trying not to run up against sandbox limits while also going as fast as possible is a battle. We end up needing every connection we can get.

I don't know for sure how the storage client works behind the scenes and i'm not an expert in this area at all. Here's my wild theory.

From what i've read (SO, github issues, etc.) the storage client doesn't do any explicit connection pooling/management like the some of the other clients (service bus for example). My understanding is that it uses transient connections (open, use, try to close) but some of these "closed" connections actually end up in TIME_WAIT for up to 4 minutes. This behavior is all well and good in a webjob or app service plan but you can bump up against the limit quickly in a consumption plan.

oluatte · 2018-07-13T17:24:17Z

@cgillum I am glad you were able to reproduce the issue. Thanks for looking at it so quickly.

The concurrency throttles look like they might help (thanks for the link, I had not seen these). Just so i understand the effect specific throttles will have in this scenario...

Setting "maxConcurrentActivityFunctions" to a lower number will reduce the number of instances on a single vm. Will this make the scale controller generate more instances on other vms which should keep overall throughput high?
The documentation mentions that setting "extendedSessionsEnabled" to true can reduce the number of interactions with storage (which seems good) but that has a detrimental impact on overall throughput.

Which of these throttles (or is it a combination) do you recommend setting for this scenario?

cgillum · 2018-07-21T01:03:35Z

Sorry for the delay - somehow I missed this notification for your followup questions (and only saw this when discussing this topic with a colleague).

Without knowing the true source of the excessive connections, it's hard to make a recommendation. In fact, one thing that I will need to figure out is whether it is the Durable extension which is responsible for these, or if it could potentially be something in the functions host. I may need to look at a memory dump to figure this out (shivers).

Getting to your questions:

Yes, I expect lowering maxConcurrentActivityFunctions will result in more VMs because the scale controller will detect that it needs more VMs to keep the queue latencies small.
I definitely recommend extendedSessionsEnabled if you are doing fan-out/fan-in. It could significantly reduce the number of table storage queries.

oluatte · 2018-07-23T15:05:16Z

@cgillum

No worries, thank you for the response. We will tweak the knobs and see how it goes.

One more thing I noticed during the week that may or may not help. Below is a screenshot of metrics from the storage account that backs the function app. These numbers are for a 24 hours period and there were 7K function executions in that time. The ratio of storage accesses to function executions seems high (esepcially for queues).

Thanks again for your help.

PS. I hope this doesn't go down to memory dumps (shivers on your behalf). Good luck :)

cgillum · 2018-07-23T16:26:05Z

Thanks for this data. I'm thinking there are a few levels of throttles we need to consider to reduce the load on storage, especially when things are relatively idle (related issue: #391).

cgillum · 2018-08-03T18:39:41Z

Just an update on this - I've created a fix in DTFx that does two things:

Throttles the concurrency of internal storage operations. This appears to be the cause of the TCP connection spikes, based on some private testing I did. It will primarily help with fan-out, fan-in scenarios.
Increased the max polling delay from 10 seconds to 30 seconds when the functions host is running (scale controller operations are not changed). This may reduce some of the queue operations. But generally speaking, I expect there will be several queue operations for every activity function call.

The improvements will be included in the next release.

oluatte · 2018-08-03T20:03:04Z

@cgillum Thank you for the update! That should help for sure. I'll keep an eye out for the next release.

cgillum · 2018-08-30T05:29:11Z

The v1.6.0 release is now available and includes the improvements I mentioned. In addition to this, we're making changes to the Azure Functions consumption plan to allow a greater number of concurrent TCP connections (from 300 to 600), though the later change won't be available for another month or two due to the slower platform release cycles.

IMPORTANT NOTE: The v1.6.0 release is compatible with Azure Functions v1 or builds of Azure Functions v2, v2.0.12050 or later. It is not compatible with earlier versions of Azure Functions v2.

oluatte · 2018-08-30T14:41:56Z

@cgillum That is excellent news. Thanks for all the hard work looking into this.

oluatte · 2018-10-26T17:34:04Z

@cgillum Hey Chris,

Thanks again for all your hard work on this issue. We really appreciated it on that project. Unfortunately, we're facing this same issue on a different project with even greater volume than the previous one. I wanted to ask if the increase in TCP connections had been deployed/released, as I still see the host error at ~300 connections. If not, do you have more information about when it might go out?

Also. Please let me know if this isn't the right place to ask, or if i should create a new issue and reference this one.

cgillum · 2018-10-26T17:38:17Z

Re-opening since we're still seeing some occurrences of this problem.

Over the next two weeks the limit will be automatically increased worldwide from 300 to 600. Hopefully that will help for your current project.

At the same time, we're also implementing a temporary hack to force the Azure Storage SDK to limit the number of outbound TCP connections it opens when talking to the storage service. @brettsam is working on a PR for that here: #486

oluatte · 2018-10-26T17:41:18Z

@cgillum That was a super fast response, thank you.

Glad to hear of the timeline & upcoming changes. Will keep an eye out.

brandonh-msft · 2018-10-30T16:57:27Z

@cgillum, @brettsam
I'm not sure this fixes the issue... I pulled the latest DF dev/ bits and ran the script we were given which elicits this behavior... just got this in the Live Stream:

host.json for the run looks like:

      "extendedSessionsEnabled": true,
      "extendedSessionIdleTimeoutInSeconds": 30,
      "maxConcurrentActivityFunctions": 1000,
      "maxConcurrentOrchestratorFunctions": 500

cgillum added Performance Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. labels Jul 13, 2018

cgillum added dtfx and removed Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. labels Aug 3, 2018

cgillum added this to the Quality Milestone (v1.5.1) milestone Aug 3, 2018

cgillum mentioned this issue Aug 3, 2018

DurableTask.AzureStorage performance and reliability improvements: Azure/durabletask#201

Merged

cgillum mentioned this issue Aug 28, 2018

DurableTask.AzureStorage v1.3.0 release payload Azure/durabletask#212

Merged

4 tasks

cgillum closed this as completed Aug 30, 2018

cgillum reopened this Oct 26, 2018

cgillum mentioned this issue Oct 26, 2018

temporary workaround for setting MaxConnectionsPerServer in Storage #486

Merged

oluatte mentioned this issue Nov 28, 2018

Allow configuring HttpClient creation Azure/azure-storage-net#580

Closed

cgillum removed this from the v1.6.0 Release milestone Nov 29, 2018

oluatte closed this as completed Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

oluatte commented Jul 13, 2018

cgillum commented Jul 13, 2018

cgillum commented Jul 13, 2018

oluatte commented Jul 13, 2018

oluatte commented Jul 13, 2018

cgillum commented Jul 21, 2018 •

edited

Loading

oluatte commented Jul 23, 2018

cgillum commented Jul 23, 2018

cgillum commented Aug 3, 2018

oluatte commented Aug 3, 2018

cgillum commented Aug 30, 2018 •

edited

Loading

oluatte commented Aug 30, 2018

oluatte commented Oct 26, 2018

cgillum commented Oct 26, 2018

oluatte commented Oct 26, 2018

brandonh-msft commented Oct 30, 2018 •

edited

Loading

Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

Comments

oluatte commented Jul 13, 2018

cgillum commented Jul 13, 2018

cgillum commented Jul 13, 2018

oluatte commented Jul 13, 2018

oluatte commented Jul 13, 2018

cgillum commented Jul 21, 2018 • edited Loading

oluatte commented Jul 23, 2018

cgillum commented Jul 23, 2018

cgillum commented Aug 3, 2018

oluatte commented Aug 3, 2018

cgillum commented Aug 30, 2018 • edited Loading

oluatte commented Aug 30, 2018

oluatte commented Oct 26, 2018

cgillum commented Oct 26, 2018

oluatte commented Oct 26, 2018

brandonh-msft commented Oct 30, 2018 • edited Loading

cgillum commented Jul 21, 2018 •

edited

Loading

cgillum commented Aug 30, 2018 •

edited

Loading

brandonh-msft commented Oct 30, 2018 •

edited

Loading