-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent request timeouts (408) #1610
Comments
One more addition or idea: How reasonable would be to
? |
|
Thank you @j82w .
|
@ealsur any suggestions?
|
Could there be a problem with thread exhaustion? |
I looked at the server side logs based on the info in the exception and I don't see any timeout or other errors beside some 429s. It's possible you are hitting SNAT Port exhaustion. |
Can this be verified somehow? Will it be shown in the diagnostics property? |
Thank you @ealsur . Those articles refer to VMs, so I found this one which talks specifically about App Services, which is our case. All of the solutions mentioned there are about changing the code, e.g.
However, can I somehow enforce these when using the Cosmos SDK (maybe apart from setting different retry timeouts)? We are already using a single instance of the client per application. |
@j82w So even when I get a 408 response, it doesn't necessarily mean that it actually came from the server? It might be the client telling me "I could not even open the connection due to SNAT port exhaustion"? |
@skurik For App Service this one is also good: https://azure.github.io/AppService/2018/03/01/Deep-Dive-into-TCP-Connections-in-App-Service-Diagnostics.html |
Did you ever get a resolution to this, @skurik? I am getting same :( |
@dan-matthews please check out the request timeout troubleshooting guide. |
Thanks for feedback @j82w, I have been through that in detail already. I'm running on Linux App Service with a good partition key (document id) and I've tested in both Direct and Gateway modes, and played with idle timeouts and port re-use. I'm async in my whole architecture, use a singleton to hold my Cosmos client and I'm using .Net Core 3.1 and the latest version of the Cosmos DB SDK (3.12.0). I've also used the troubleshooter for my TCP connections in my App Service and everything is stable at 50 to 60 connections, nothing failing. The CPU on the App Service is running stable at about 5%, memory at about 40% and the RUs of the CosmosDB are peaking below 500 (it's autopilot up to 4,000). I've put logging on my CosmosDB and it seems the requests don't even get to it, because there isn't any queries running there for more than a few milliseconds (or... if it is running, it's returning quickly and getting lost). Basically, the entire architecture is just ticking over, not breaking sweat at all. Yet, no matter what I try playing with, I still get 408 and socket timeouts on random requests. Normally at a rate of about 1 in 100. It also doesn't matter whether the App Service has just started or been running a few hours. The error is always occurring on the MoveNext of a Cosmos method - whether it's a Feed Iterator, a Stream Iterator or just trying CreateContainerIfNotExistsAsync. Here is an example of one - this hung for 1.1 minutes then crashed out with a CanceledException:
Or another, this time it hung for 1.1 mins and then crashed with a SocketException:
It basically seems like I makes the request and then loses the response, so it just hangs. If you have any other ideas I'd love to hear them because I'm kinda running out of options :) I did read somewhere to change the await to a Wait() on a Task, so I tried that with no luck. I'm desperate. I'll try anything ;) |
Are you by any chance doing a I would recommend Contacting Azure Support. |
No, I'm actually storing the container in a member variable in the singleton service, so I only ever resolve it once. The only requests going out are just simple queries. I guess I will have to contact Azure Support... appreciate the feedback though! |
Any updates? We seem to be having a similar issue |
All http requests support username and pw. No username="-", no password="-", password="*". |
FYI We are running into a similar issues with Azure Function. I will open a separate issue. |
@j82w Often I get RequestTimeout errors. Below is a sample exception message: Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 676b0cc7-e472-4b9a-985f-190cd66cae94; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:07/19/2021 10:16:09; Total Duration:65011.0795 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 676b0cc7-e472-4b9a-985f-190cd66cae94;) I have followed the Troubleshooting guide but couldn't fix the issue. Background: We are making ReadContainerAsync call as healthcheck for our service. We hit this call 30 times per min from each pod. This failure is seen at an average 1 per min overall. The frequency is not too high in comparison to total requests made. But we would like to fix it if possible. |
@SumiranAgg do not use ReadContainerAsync as a health check. The read container is a metadata operation. The metadata operations is Cosmos DB are limited and will eventually get throttled. It is also only called once on the SDK initialization. I would recommend doing a data plane operation like ReadItemStream on a non-existing document. This will make sure you can actually connect and get a response from the container. Regarding the RequestTimeout make sure you are using the latest SDK 3.20.1. If it's still an issue after these changes it would be best to open a support ticket.. |
Did anyone ever figure this out? I'm seeing the same thing. I've have a simple Azure API App that calls Cosmos DB to read a record. In my case, the api is pretty simple. I can start the service (running under a linux container/NET6) and it doesn't do anything until the web request comes in. That first request creates the At least one stack is showing CreateDatabaseIfNotExistsAsync and Move.Next as at the top. This code was working fine under Azure Api App just a few days prior. I am using direct mode. I feel like it must be vnet related but haven't had any luck identifying what that may be. The subnet all have permit everything within the vnet and the Cosmos service enabled. The same vnet is attached to the ApiApp and the cosmos instance. Any ideas are appreciated. |
@j82w What is the correct way to do this if I want to do |
In my case, it was something in the Azure cloud. I spent 3+ weeks with Azure support and they never figured it out. I rebuild my my Azure resources for scripts, deployed the exact same binary and it worked. A week later, the broken/first attempt began working again. |
We recently started using Azure Cosmos DB and it became obvious we don't fully understand how to deal with some of the issues it brings.
In particular, we are observing a large number of request timeouts.
The exceptions look like this:
We are migrating lots of data from SQL Server to Cosmos and the access pattern is as follows:
There may be other threads writing to Cosmos at the same time, but these will typically write just a few items at a time.
We are using a single instance of CosmosClient throughout the application.
On Azure Portal, I can see we are not being throttled:
So my question is basically - why are the requests timeouting so often when we don't even hit the provisioned RU limit? (we currently have 11,000 RU/s in autoscale mode).
Are we using it wrong? Is there a recommended pattern for inserting batch/large amount of data at once?
AllowBulkExecution
is not really useful as it waits up to 1 second for a batch to filland there will be situations where the batch will just not fill up quickly enough (the above migrator runs only every 10 minutes).
Can request timeouts be also caused by rate throttling (but that would not make much sense as the Azure Portal shows we are not being rate-throttled).
I read through the request timeout troubleshooting guide and the only relevant points seem to be these:
And these two points go back to my question - how do I evaluate precisely what's the reason for the timeouts? I can try raising the provisioned RUs up until the point timeouts stop but that hardly seems like a reasonable approach.
Thank you for any insight.
The text was updated successfully, but these errors were encountered: