-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Durable Functions (Fan Out) + Azure Storage = Host Threshold Exceeded (Connections) #389
Comments
Very interesting. That's definitely not expected and I've not seen this kind of issue with Durable Functions running in our Consumption plan yet. Just to make sure I'm understanding correctly, you're pretty confident that nothing in your code could be causing these connections to build-up, right? I'll need to see if I can reproduce this locally - but I agree that using so many connections seems excessive. Under the hood we're just using the Azure Storage client SDK and caching client objects, so I wouldn't expect there to be any connection leaks. |
I ran a quick load test on my local box and sure enough, I see the number of established TCP connections spike up to almost 200. In my case, it wasn't a fan-out/fan-in, but rather a large number of sequential orchestrations running concurrently. As a potential workaround, could you try limiting the concurrency using the concurrency throttles? |
Yeah, It's an interesting issue for sure. I've used functions (durable and non durable) for high throughput scenarios on several projects and trying not to run up against sandbox limits while also going as fast as possible is a battle. We end up needing every connection we can get. I don't know for sure how the storage client works behind the scenes and i'm not an expert in this area at all. Here's my wild theory. From what i've read (SO, github issues, etc.) the storage client doesn't do any explicit connection pooling/management like the some of the other clients (service bus for example). My understanding is that it uses transient connections (open, use, try to close) but some of these "closed" connections actually end up in TIME_WAIT for up to 4 minutes. This behavior is all well and good in a webjob or app service plan but you can bump up against the limit quickly in a consumption plan. |
@cgillum I am glad you were able to reproduce the issue. Thanks for looking at it so quickly. The concurrency throttles look like they might help (thanks for the link, I had not seen these). Just so i understand the effect specific throttles will have in this scenario...
Which of these throttles (or is it a combination) do you recommend setting for this scenario? |
Sorry for the delay - somehow I missed this notification for your followup questions (and only saw this when discussing this topic with a colleague). Without knowing the true source of the excessive connections, it's hard to make a recommendation. In fact, one thing that I will need to figure out is whether it is the Durable extension which is responsible for these, or if it could potentially be something in the functions host. I may need to look at a memory dump to figure this out (shivers). Getting to your questions:
|
No worries, thank you for the response. We will tweak the knobs and see how it goes. One more thing I noticed during the week that may or may not help. Below is a screenshot of metrics from the storage account that backs the function app. These numbers are for a 24 hours period and there were 7K function executions in that time. The ratio of storage accesses to function executions seems high (esepcially for queues). Thanks again for your help. PS. I hope this doesn't go down to memory dumps (shivers on your behalf). Good luck :) |
Thanks for this data. I'm thinking there are a few levels of throttles we need to consider to reduce the load on storage, especially when things are relatively idle (related issue: #391). |
Just an update on this - I've created a fix in DTFx that does two things:
The improvements will be included in the next release. |
@cgillum Thank you for the update! That should help for sure. I'll keep an eye out for the next release. |
The v1.6.0 release is now available and includes the improvements I mentioned. In addition to this, we're making changes to the Azure Functions consumption plan to allow a greater number of concurrent TCP connections (from 300 to 600), though the later change won't be available for another month or two due to the slower platform release cycles. IMPORTANT NOTE: The v1.6.0 release is compatible with Azure Functions v1 or builds of Azure Functions v2, v2.0.12050 or later. It is not compatible with earlier versions of Azure Functions v2. |
@cgillum That is excellent news. Thanks for all the hard work looking into this. |
@cgillum Hey Chris, Thanks again for all your hard work on this issue. We really appreciated it on that project. Unfortunately, we're facing this same issue on a different project with even greater volume than the previous one. I wanted to ask if the increase in TCP connections had been deployed/released, as I still see the host error at ~300 connections. If not, do you have more information about when it might go out? Also. Please let me know if this isn't the right place to ask, or if i should create a new issue and reference this one. |
Re-opening since we're still seeing some occurrences of this problem. Over the next two weeks the limit will be automatically increased worldwide from 300 to 600. Hopefully that will help for your current project. At the same time, we're also implementing a temporary hack to force the Azure Storage SDK to limit the number of outbound TCP connections it opens when talking to the storage service. @brettsam is working on a PR for that here: #486 |
@cgillum That was a super fast response, thank you. Glad to hear of the timeline & upcoming changes. Will keep an eye out. |
@cgillum, @brettsam host.json for the run looks like: "extendedSessionsEnabled": true,
"extendedSessionIdleTimeoutInSeconds": 30,
"maxConcurrentActivityFunctions": 1000,
"maxConcurrentOrchestratorFunctions": 500 |
Hey folks,
Thanks for all your hard work on durable functions. I've used them on multiple projects and i'm a big fan.
On a recent project, we ran into an issue that occurs with durable functions that interact with azure storage. For example:
This setup causes the "Host Threshold Exceeded (Connections)" error very quickly. I was able to monitor the outbound connections while running the function app locally and there are too many connections to storage (queues, tables, blobs). I even commented out a bunch of our code and there were 120+ outbound connections to storage within 60 seconds (I have the IPs available) and it only kept increasing.
I think there is an issue here. Either the outbound connections from the durable framework shouldn't count against the sandbox limits or the durable framework should minimize the number of outbound connections it makes. Taking up so many of the available connections really limits what is left over for the actual logic of the function.
Please note that we followed all the connection management recommendations (lazy init static clients etc).
Some details about the function app:
The text was updated successfully, but these errors were encountered: