Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

LouisSikkes1 · 2024-11-26T10:34:08Z

Describe the bug

After upgrading package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1, the average response time of our API drastically increases. This causes our web application to become very slow and unusable for our customers.

To Reproduce

Upgrade package Microsoft.Azure.Cosmos from version 3.31.0 to 3.31.1.

Expected behavior

We expect a patch version upgrade not to impact the performance negatively.

Actual behavior

The patch version upgrade drastically increases the average response time. I am unable to upload a picture for some reason, but we can clearly see the difference between the average response time with and without the package upgrade. Before we upgraded, the average response time was stable below 5 ms. After we upgraded Microsoft.Azure.Cosmos to 3.31.1, it would rise to around 20 ms. After we downgraded it again to 3.31.0, we can clearly see that the average response time is again below 5 ms.

The average response time may seem relatively low. The reason is that this graph is from a test environment. There is a low amount of activity, but as a result, the differences are clearly visualized. On a production environment, the average response times are much higher after the package upgrade (the average is around 1 second). A production environment has hundreds of users and as a result our application becomes unusably slow.

Workaround

The issue we encounter seems the same as the one reported in #3613. This issue has been closed in the meantime. The symptoms that we encounter seem to be the same that they have reported. The solution that this issue has been closed with was to upgrade the app service plan from P1V2 to P1V3. We have tried to upgrade the app service plan as well on our test environment. There, we saw that this also solves the issue for our application. However, upgrading the plan a single tier doubles the cost and as such we do not see this as a viable option. Similar to the reported issue, we do not see why the change in plan matters, since we do not see that cpu or memory of the machine is a bottleneck.

Environment summary

SDK: .NET8
App service plan: Premium v3 P0V3

Additional context

The package version of Microsoft.Azure.Cosmos 3.31.0 is over 2 years old and several new versions have come out in the meantime. We have tried to use newer versions as well in the hopes that the issue might have been mixed in the meantime. Unfortunately, we can still reproduce the issue with version 3.45.0.
I have compared the changes done in the repository between version 3.31.0 and 3.31.1 and the only change that happened was the upgrade of Microsoft.Azure.Cosmos.Direct from 3.29.1 to 3.29.4. releases/3.31.0...releases/3.31.1
I have also looked at the changes done in the Microsoft.Azure.Cosmos.Direct package. Due to the large amount of changes, I cannot make sense of this. https://github.com/Azure/azure-cosmos-dotnet-v3/commits/msdata/direct/Microsoft.Azure.Cosmos/src/direct

The text was updated successfully, but these errors were encountered:

sourabh1007 · 2024-11-26T16:27:28Z

can you please share the diagnostic string for patch operation with slower and faster sdk version. Maybe we can compare them by putting side by side and find out, why you are getting high latency.

LouisSikkes1 · 2024-12-02T15:11:16Z

I have gathered some diagnostics by implementing the solution mentioned on this post (1), which was linked in #3613. This has gathered data for 3+ days, in which only 3 calls to cosmos reported to have taken more than 500 ms, see zip below. In the average duration of our requests, I can still see clearly that the problem is occurring from our logging when looking at the request duration.

(1) https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-slow-request?tabs=cpu-new#capture-diagnostics

cosmos_diagnostics.zip

LouisSikkes1 · 2024-12-18T12:51:46Z

I have done an investigation to further pinpoint the source of the issue. I have slowly removed parts of our code base related to CosmosDB to determine when the symptoms stopped showing. After doing this, it became clear that only having open a connection with the CosmosDB is enough to produce the latency problems. This means the problem is not related to any queries or such that we do. The code is following the best practices outlined in the documentation regarding only creating a single client throughout the apps lifetime. (1)

I am currently looking into any configuration options that can be passed when setting up the connection. I was wondering if this problem is known and if you have any recommendations for the type of machine that we are using. As mentioned in the initial description, our app is hosted on Azure using the 'Premium v3 P0V3' plan. The symptoms mentioned are still present when using the default options as follows:
var cosmosClient = new CosmosClient(_options.AccountEndpoint, _options.AccountKey);

Note: I also read in the best practices document (1) that it is recommended to use a machine that has at least 8-GB memory and 4-cores. I realize that the 'Premium v3 P0V3' does not fulfill these requirements. However, as mentioned before, a machine that does is at least twice as expensive as the one that we are currently using so I am trying to avoid having to scale up.

(1) https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/best-practice-dotnet

FabianMeiswinkel · 2024-12-18T13:42:34Z

Looks like these are the commits consumed

https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB/commit/3fad226ed6247450bde8b5fad151ed7349d560e8?refName=refs/heads/sdkReleases/direct/EN20220407_3.29.4

https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB/commit/3152c9d43fdfa539755f25e1cd9bf7bfa8166b56?refName=refs/heads/sdkReleases/direct/EN20220407_3.29.4

https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB/commit/fd78585262f644246f54a8f16201fd71871e296c?refName=refs/heads/sdkReleases/direct/EN20220407_3.29.4

The first one has the highest likelihood to change client-side resource consumption

LouisSikkes1 · 2024-12-18T15:11:35Z

I do not have access to these links

FabianMeiswinkel · 2024-12-18T15:12:57Z

I do not have access to these links

Understood - added them for engineer looking into this

kirankumarkolli · 2024-12-18T16:19:12Z

@LouisSikkes1 the attached diagnostics ZIP file is showing as not valid. Can you please re-add again?

LouisSikkes1 · 2024-12-19T09:55:02Z

For me it's working fine if I download it. Nonetheless, here is the zip again.

cosmos_diagnostics.zip

LouisSikkes1 added the needs-investigation label Nov 26, 2024

microsoft-github-policy-service bot added the customer-reported Issue created by a customer label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

LouisSikkes1 commented Nov 26, 2024

sourabh1007 commented Nov 26, 2024

LouisSikkes1 commented Dec 2, 2024

LouisSikkes1 commented Dec 18, 2024 •

edited

Loading

FabianMeiswinkel commented Dec 18, 2024

LouisSikkes1 commented Dec 18, 2024

FabianMeiswinkel commented Dec 18, 2024

kirankumarkolli commented Dec 18, 2024

LouisSikkes1 commented Dec 19, 2024

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

Performance decrease when upgrade from 3.31.0 to 3.31.1 #4906

Comments

LouisSikkes1 commented Nov 26, 2024

Describe the bug

To Reproduce

Expected behavior

Actual behavior

Workaround

Environment summary

Additional context

sourabh1007 commented Nov 26, 2024

LouisSikkes1 commented Dec 2, 2024

LouisSikkes1 commented Dec 18, 2024 • edited Loading

FabianMeiswinkel commented Dec 18, 2024

LouisSikkes1 commented Dec 18, 2024

FabianMeiswinkel commented Dec 18, 2024

kirankumarkolli commented Dec 18, 2024

LouisSikkes1 commented Dec 19, 2024

LouisSikkes1 commented Dec 18, 2024 •

edited

Loading