Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP connections much higher than GatewayModeMaxConnectionLimit #1170

Closed
jesuissur opened this issue Jan 23, 2020 · 20 comments · Fixed by #1740
Closed

TCP connections much higher than GatewayModeMaxConnectionLimit #1170

jesuissur opened this issue Jan 23, 2020 · 20 comments · Fixed by #1740
Assignees
Labels
bug Something isn't working needs-investigation

Comments

@jesuissur
Copy link

Describe the bug
In Gateway mode with GatewayModeMaxConnectionLimit around 500, TCP connections used by our Azure App service towards our Cosmos DB goes far beyond the 500 limit.

To Reproduce

  1. We got a single CosmosClient built like that (new CosmosClientBuilder(endpoint, authKey).WithConnectionModeGateway(500).Build())
  2. We use a single Container, hence a single database and collection. Code looks like that client.GetContainer(config.OurDatabase, config.OurDataCollection)
  3. From there, we use the container in 2 ways. Either we read from with container.GetItemQueryStreamIterator or write to with await container.UpsertItemStreamAsync

Our app is living alone on an app service plan ("server")

Expected behavior
We expect to have at most 500 TCP connections towards our Cosmos DB but under some load (5K Cosmos request under 1 minute) we get to the app service's TCP connections limit (in our case, ~4K)

Actual behavior
We got many SocketExceptions resulting in faulted Cosmos Queries.

One of our TCP peak problems:
image

Environment summary
SDK Version: 3.5.0
OS Version: Windows App service P2v2

Thanks a lot
Phil

@j82w j82w added bug Something isn't working needs-investigation labels Jan 27, 2020
@j82w
Copy link
Contributor

j82w commented Jan 27, 2020

@jesuissur can you try setting the max concurrency on the query to see if it helps reduce the connections?

@kirankumarkolli
Copy link
Member

At conceptual level each collection data will get sharded/partitioned onto multiple VM's. In Direct mode the SDK will connection to the VM's directly where the data resides, where as in Gateway mode served by an intermediary proxy VM's. Hence Gateway mode can get better connection sharing.

For functions/app-services our recommendation is use plan/offer which doesn't have connection limits for better performance (i.e. Direct access) or use Gateway mode.

@jesuissur
Copy link
Author

@j82w Each one of our query goes to one and only partition, so we guess the max concurrency would not have any effect. And the documentation also states the default value would not do any parallelism

@jesuissur
Copy link
Author

@kirankumarkolli we already use the Gateway mode. The Azure's plan/offer without outbound connections limit seems overkill for our needs and quite pricey to have the same performance

@j82w
Copy link
Contributor

j82w commented Jan 29, 2020

@jesuissur stepping through the code it looks like it should be honoring the limit. Can you please answer the following questions.

  1. What version of .NET are you using? .NET Core 3.0?
  2. What OS are you using?
  3. Is the TCP count for concurrent request or is that total number of TCP connections over a time period?
  4. Can you use netstat to verify it's using all those port concurrently?

@jesuissur
Copy link
Author

jesuissur commented Jan 31, 2020

@j82w here are our facts

  1. What version of .NET are you using? .NET Core 3.0?

.Net core 2.2

  1. What OS are you using?

Windows on Azure App service Plan P2v2 and P3v2. I don't know the specific windows version behind those plans

  1. Is the TCP count for concurrent request or is that total number of TCP connections over a time period?

It is the highest concurrent TCP outbound connections at one moment in time. More info here: https://azure.github.io/AppService/2018/03/01/Deep-Dive-into-TCP-Connections-in-App-Service-Diagnostics.html

  1. Can you use netstat to verify it's using all those port concurrently?

We cannot get those stats without the Azure Diagnostic tools because we cannot detect those peaks moment in advance. But pretty sure the Diagnostics tools use something similar to get those in (almost) real time (15 minutes delay)

Thanks again
Phil

@jesuissur
Copy link
Author

Hi there

We still got massive problem with our connections count above the 500 limit we set as the Gateway's max connections limit.

Looking at the SDK code, the only (relevant) place the max connections limit seems to be used is at this place:
image

Does the new Cosmos SDK use the legacy ServicePoint mechanism to connect to Cosmos?
If so, would it be possible the problem is coming from the not implemented CurrentConnections property in .Net core?

@jesuissur
Copy link
Author

By looking further at the code, it does seems to use the HttpClient to connect to Cosmos and not the ServicePoint at all.

If that's really the case, don't you think we miss something like this when creating the HttpClientHandler?
image

@j82w
Copy link
Contributor

j82w commented Feb 14, 2020

Are you creating a single CosmosClient for the application or are you creating multiple instances?

@jesuissur
Copy link
Author

jesuissur commented Feb 14, 2020

@j82w as we stated in the initial post we create a single CosmosClient through the builder. We've check and recheck this fact.

We just built our own version of the SDK with the aforementioned modification in the CosmosClient.CreateHttpClientHandler and did some tests. It seems to fix the problem. We will do further testing but it looks like missing piece. Our testing is still on-going

jesuissur added a commit to DashThis/azure-cosmos-dotnet-v3 that referenced this issue Feb 17, 2020
@jesuissur
Copy link
Author

After a few tests from our side, the previous commit indeed fix our problem.

This fix is probably not the one you would like in the official release because we use a Gateway setting directly into the CosmosClient. Nonetheless, something like this is missing

Thanks
Phil

@j82w j82w self-assigned this Feb 19, 2020
@j82w
Copy link
Contributor

j82w commented Feb 19, 2020

@jesuissur thanks for investigating. I'll try to get a PR out with a fix before the next release.

@j82w
Copy link
Contributor

j82w commented Feb 20, 2020

There seems to be a bug for .NET Framework using ServicePointManager. I'm not sure if it's impact .net core.

@vsadams
Copy link

vsadams commented Jul 9, 2020

Do we have an idea on which release this is slated for? So glad you guys found this. We have been struggling with this for a while. We have tried so many things on the consumer end to fix and could never figure out a solution.

@j82w
Copy link
Contributor

j82w commented Jul 9, 2020

This PR #1548 needs to go in first. Then this fix can be added.

@vsadams
Copy link

vsadams commented Jul 14, 2020

I am having the same issue. Setting the HttpClientFactory on the CosmosClientOptions with a func handle that limits the MaxConnectionsPerServer seems to be a successful workaround for the time being.

For context, my environments run .net core 3.1 on a Windows Azure app services.

  
//DI setup for injecting an IHttpClientFactory
//GatewayModeMaxConnectionLimit is a user variable for overriding the default 50 specified in the sdk. 
//for the workaround 50 is just hard coded
services.AddHttpClient("NamedCosmosClient") //named client is just bc I have multiple Cosmos Service connections.
        .ConfigurePrimaryHttpMessageHandler((sp) =>
        {
            var handler = new SocketsHttpHandler
            {
                MaxConnectionsPerServer = GatewayModeMaxConnectionLimit ?? 50,
            };

            return handler;
        });
//cosmos service setup
var cosmosOptions = new CosmosClientOptions
{
    HttpClientFactory = ()
        => HttpClientFactory.CreateClient("NamedCosmosClient");
};

var client = new CosmosClient(uri, key, cosmosOptions);

Here is a before and after of a job that usually wreaks havoc
image
vs
image

Most of the connections in the above are too cosmos.

@vsadams
Copy link

vsadams commented Jul 16, 2020

Further update, while this did help in some scenarios, it did not help in all. There is still another client being used by the sdk which seems to be making more connections.

@j82w
Copy link
Contributor

j82w commented Jul 16, 2020

@vsadams the issue with the multiple clients is getting fixed in this PR: #1548

@vsadams
Copy link

vsadams commented Jul 16, 2020

Sweet! Thank you @j82w

@j82w
Copy link
Contributor

j82w commented Aug 7, 2020

@vsadams and @jesuissur 3.12.0 is released with the fixes. Please create a new issue if you are still seeing an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants