Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389

Closed
oluatte opened this issue Jul 13, 2018 · 15 comments

Comments

@oluatte
Copy link

oluatte commented Jul 13, 2018

Hey folks,

Thanks for all your hard work on durable functions. I've used them on multiple projects and i'm a big fan.

On a recent project, we ran into an issue that occurs with durable functions that interact with azure storage. For example:

  • Durable orchestrator receives a batch of event data to save.
  • Durable orchestrator fans out activity functions to save event data as small blobs in storage
  • At an interval, another Durable orchestrator fans out activity functions to combine the small blobs into one large blob for upload to another data store e.g. data lake.

This setup causes the "Host Threshold Exceeded (Connections)" error very quickly. I was able to monitor the outbound connections while running the function app locally and there are too many connections to storage (queues, tables, blobs). I even commented out a bunch of our code and there were 120+ outbound connections to storage within 60 seconds (I have the IPs available) and it only kept increasing.

I think there is an issue here. Either the outbound connections from the durable framework shouldn't count against the sandbox limits or the durable framework should minimize the number of outbound connections it makes. Taking up so many of the available connections really limits what is left over for the actual logic of the function.

Please note that we followed all the connection management recommendations (lazy init static clients etc).

Some details about the function app:

  • Function Runtime: 2.0.11888.0
  • Durable Extension Version: 1.5.0
  • Region: East US 2
  • Invocation Id: 4c91d7e6-a206-4997-919d-b9555ce9b2c8
@cgillum
Copy link
Member

cgillum commented Jul 13, 2018

Very interesting. That's definitely not expected and I've not seen this kind of issue with Durable Functions running in our Consumption plan yet.

Just to make sure I'm understanding correctly, you're pretty confident that nothing in your code could be causing these connections to build-up, right?

I'll need to see if I can reproduce this locally - but I agree that using so many connections seems excessive. Under the hood we're just using the Azure Storage client SDK and caching client objects, so I wouldn't expect there to be any connection leaks.

@cgillum cgillum added Performance Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. labels Jul 13, 2018
@cgillum
Copy link
Member

cgillum commented Jul 13, 2018

I ran a quick load test on my local box and sure enough, I see the number of established TCP connections spike up to almost 200. In my case, it wasn't a fan-out/fan-in, but rather a large number of sequential orchestrations running concurrently.

As a potential workaround, could you try limiting the concurrency using the concurrency throttles?

@oluatte
Copy link
Author

oluatte commented Jul 13, 2018

Yeah, It's an interesting issue for sure. I've used functions (durable and non durable) for high throughput scenarios on several projects and trying not to run up against sandbox limits while also going as fast as possible is a battle. We end up needing every connection we can get.

I don't know for sure how the storage client works behind the scenes and i'm not an expert in this area at all. Here's my wild theory.

From what i've read (SO, github issues, etc.) the storage client doesn't do any explicit connection pooling/management like the some of the other clients (service bus for example). My understanding is that it uses transient connections (open, use, try to close) but some of these "closed" connections actually end up in TIME_WAIT for up to 4 minutes. This behavior is all well and good in a webjob or app service plan but you can bump up against the limit quickly in a consumption plan.

@oluatte
Copy link
Author

oluatte commented Jul 13, 2018

@cgillum I am glad you were able to reproduce the issue. Thanks for looking at it so quickly.

The concurrency throttles look like they might help (thanks for the link, I had not seen these). Just so i understand the effect specific throttles will have in this scenario...

  • Setting "maxConcurrentActivityFunctions" to a lower number will reduce the number of instances on a single vm. Will this make the scale controller generate more instances on other vms which should keep overall throughput high?
  • The documentation mentions that setting "extendedSessionsEnabled" to true can reduce the number of interactions with storage (which seems good) but that has a detrimental impact on overall throughput.

Which of these throttles (or is it a combination) do you recommend setting for this scenario?

@cgillum
Copy link
Member

cgillum commented Jul 21, 2018

Sorry for the delay - somehow I missed this notification for your followup questions (and only saw this when discussing this topic with a colleague).

Without knowing the true source of the excessive connections, it's hard to make a recommendation. In fact, one thing that I will need to figure out is whether it is the Durable extension which is responsible for these, or if it could potentially be something in the functions host. I may need to look at a memory dump to figure this out (shivers).

Getting to your questions:

  • Yes, I expect lowering maxConcurrentActivityFunctions will result in more VMs because the scale controller will detect that it needs more VMs to keep the queue latencies small.
  • I definitely recommend extendedSessionsEnabled if you are doing fan-out/fan-in. It could significantly reduce the number of table storage queries.

@oluatte
Copy link
Author

oluatte commented Jul 23, 2018

@cgillum

No worries, thank you for the response. We will tweak the knobs and see how it goes.

One more thing I noticed during the week that may or may not help. Below is a screenshot of metrics from the storage account that backs the function app. These numbers are for a 24 hours period and there were 7K function executions in that time. The ratio of storage accesses to function executions seems high (esepcially for queues).

image

Thanks again for your help.

PS. I hope this doesn't go down to memory dumps (shivers on your behalf). Good luck :)

@cgillum
Copy link
Member

cgillum commented Jul 23, 2018

Thanks for this data. I'm thinking there are a few levels of throttles we need to consider to reduce the load on storage, especially when things are relatively idle (related issue: #391).

@cgillum cgillum added dtfx and removed Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. labels Aug 3, 2018
@cgillum
Copy link
Member

cgillum commented Aug 3, 2018

Just an update on this - I've created a fix in DTFx that does two things:

  1. Throttles the concurrency of internal storage operations. This appears to be the cause of the TCP connection spikes, based on some private testing I did. It will primarily help with fan-out, fan-in scenarios.
  2. Increased the max polling delay from 10 seconds to 30 seconds when the functions host is running (scale controller operations are not changed). This may reduce some of the queue operations. But generally speaking, I expect there will be several queue operations for every activity function call.

The improvements will be included in the next release.

@oluatte
Copy link
Author

oluatte commented Aug 3, 2018

@cgillum Thank you for the update! That should help for sure. I'll keep an eye out for the next release.

@cgillum
Copy link
Member

cgillum commented Aug 30, 2018

The v1.6.0 release is now available and includes the improvements I mentioned. In addition to this, we're making changes to the Azure Functions consumption plan to allow a greater number of concurrent TCP connections (from 300 to 600), though the later change won't be available for another month or two due to the slower platform release cycles.

IMPORTANT NOTE: The v1.6.0 release is compatible with Azure Functions v1 or builds of Azure Functions v2, v2.0.12050 or later. It is not compatible with earlier versions of Azure Functions v2.

@cgillum cgillum closed this as completed Aug 30, 2018
@oluatte
Copy link
Author

oluatte commented Aug 30, 2018

@cgillum That is excellent news. Thanks for all the hard work looking into this.

@oluatte
Copy link
Author

oluatte commented Oct 26, 2018

@cgillum Hey Chris,

Thanks again for all your hard work on this issue. We really appreciated it on that project. Unfortunately, we're facing this same issue on a different project with even greater volume than the previous one. I wanted to ask if the increase in TCP connections had been deployed/released, as I still see the host error at ~300 connections. If not, do you have more information about when it might go out?

Also. Please let me know if this isn't the right place to ask, or if i should create a new issue and reference this one.

@cgillum
Copy link
Member

cgillum commented Oct 26, 2018

Re-opening since we're still seeing some occurrences of this problem.

Over the next two weeks the limit will be automatically increased worldwide from 300 to 600. Hopefully that will help for your current project.

At the same time, we're also implementing a temporary hack to force the Azure Storage SDK to limit the number of outbound TCP connections it opens when talking to the storage service. @brettsam is working on a PR for that here: #486

@cgillum cgillum reopened this Oct 26, 2018
@oluatte
Copy link
Author

oluatte commented Oct 26, 2018

@cgillum That was a super fast response, thank you.

Glad to hear of the timeline & upcoming changes. Will keep an eye out.

@brandonh-msft
Copy link
Member

brandonh-msft commented Oct 30, 2018

@cgillum, @brettsam
I'm not sure this fixes the issue... I pulled the latest DF dev/ bits and ran the script we were given which elicits this behavior... just got this in the Live Stream:
image

host.json for the run looks like:

      "extendedSessionsEnabled": true,
      "extendedSessionIdleTimeoutInSeconds": 30,
      "maxConcurrentActivityFunctions": 1000,
      "maxConcurrentOrchestratorFunctions": 500

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants