Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

kundadebdatta · 2024-12-10T10:38:47Z

Pull Request Template

Description

This PR bumps up the Cosmos.Direct package version to 3.37.3 which is released on top of EN20241007 release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow.

Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the LoadBalancingPartition. The team is using the .NET SDK version 3.39.0-preview to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying CosmosClient is keeping a high number of unhealthy and unused LbChannelStates.

In a nut shell, below are few of the account configurations and facts:

• account-1: 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account.

• account-2 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account.

• account-3: 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account.

To understand this in more detail, please take a look at the memory dump below.

[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]

Upon further analysis of the memory dump, it is clear that:

The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up.
Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print.

[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]
Even without the replica validation feature, the memory footprint showed a consistent increase over time.

[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]

Analysis : Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that:

With Replica Validation Enabled: Each of the impacted LoadBalancingPartition was holding more than 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.
With Replica Validation Disabled: Each of the impacted LoadBalancingPartition was holding exactly 1 Unhealthy stale LbChannelState (which is a wrapper around the Dispatcher and a Channel), when the connection to the backend replica was closed deterministically.

Let's take a look at the below diagram to understand this in more detail:

[Fig-4: The above figure shows an instance of the LoadBalancingPartition holding more than one entry of unhealthy LbChannelState.]

By looking at the above memory dump snapshot, it is clear that these stale LbChannelState entries are kept in the LoadBalancingPartition, until they are removed the openChannels list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging LbChannelStates claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption.

Proposed Solution :

There are few changes proposed to fix this scenario. These are discussed briefly in the below section:

During the replica validation phase, in the OpenConnectionAsync(), proactively remove all the Unhealthy connections from the openChannels within the LoadBalancingPartition. This guarantees that any unhealthy LbChannelStates will be removed from the LoadBalancingPartition, freeing up the additional memory.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #4467

…bChannelState Object. (#4928) This PR bumps up the `Cosmos.Direct` package version to `3.37.3` which is released on top of `EN20241007` release. In a nut-shell, the change is focused on cleaning up the unhealthy dangling connections, that were created as a part of the "Advanced Replica Selection" flow. **Problem:** One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the RNTBD connection creation and management flow in the `LoadBalancingPartition`. The team is using the .NET SDK version `3.39.0-preview` to connect to cosmos db. Recently they encountered a potential memory leak in some of their clusters and upon investigation, it appeared that one of the root cause is the underlying `CosmosClient` is keeping a high number of unhealthy and unused `LbChannelState`s. In a nut shell, below are few of the account configurations and facts: • **account-1:** 9 Partitions with 1 unique tenant. There are approx 4 to 8 clients for this tenant. 2 * no of replica regions is the client count. They have the connection warm-up enabled on this account. • **account-2** 2592 Partitions with 249 tenants/ feds. Connections created in happy path scenario: 249 x Y (Y = number of active clients for that account). Connection warm-up disabled on this account. • **account-3:** 27 Partitions with 13 tenants/ feds. CreateAndInitialize They have the connection warm-up enabled on this account. To understand this in more detail, please take a look at the memory dump below. ![ic3_memory_dump_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/b78585de-5318-436a-8511-0107d570e8d7) _[Fig-1: The above figure shows a snapshot of the memory dump taken for multiple accounts This also unviels the potential memory leak by the unhealthy connection.]_ Upon further analysis of the memory dump, it is clear that: - The number of stale unhealthy connections are higher in the accounts where we have the replica validation enabled along with the connection warm-up. - Without the connection warm-up, the number of stale unhealthy connections are comparatively lower, but still good enough to increase the memory foot-print. ![ic3_new_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/ea3a4f82-daca-40f8-bcd6-d29da806a3c8) _[Fig-2: The above figure shows how the memory footprint got increased over time, along with incoming requests. The service was finally needed to be restarted to free up the memory]_ - Even without the replica validation feature, the memory footprint showed a consistent increase over time. ![ic3_old_sdk_high_memory_accounts_hidden](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/471a50c3-1786-4f12-b5e3-1ab192c6fef5) _[Fig-3: The above figure shows the memory consumption from the IC3 partner-api service which is using an older version (v 3.25.0) of the .NET SDK and the memory consumption kept increasing with time.]_ **Analysis :** Upon further digging in to the memory dump, and re-producing the scenario locally, it was noted that: - **With Replica Validation Enabled:** Each of the impacted `LoadBalancingPartition` was holding more than `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. - **With Replica Validation Disabled:** Each of the impacted `LoadBalancingPartition` was holding exactly `1` Unhealthy stale `LbChannelState` (which is a wrapper around the `Dispatcher` and a `Channel`), when the connection to the backend replica was closed deterministically. Let's take a look at the below diagram to understand this in more detail: ![image](https://github.com/Azure/azure-cosmos-dotnet-v3/assets/87335885/f4e2d494-962b-41da-ae80-0d542d62accf) _[Fig-4: The above figure shows an instance of the `LoadBalancingPartition` holding more than one entry of unhealthy `LbChannelState`.]_ By looking at the above memory dump snapshot, it is clear that these stale `LbChannelState` entries are kept in the `LoadBalancingPartition`, until they are removed the `openChannels` list, which is responsible for maintaining the number of channels (healthy or unhealthy) for that particular endpoint. If they are not cleaned up proactively (which is exactly this case), it might end up claiming extra memory overhead. With increasing number of partitions, connections over the time, things get worse with all these unused, yet low hanging `LbChannelState`s claiming more and more memory, and causing it to a memory leak. This is the potential root cause of the increased memory consumption. **Proposed Solution :** There are few changes proposed to fix this scenario. These are discussed briefly in the below section: - During the replica validation phase, in the `OpenConnectionAsync()`, proactively remove all the `Unhealthy` connections from the `openChannels` within the `LoadBalancingPartition`. This guarantees that any unhealthy `LbChannelStates` will be removed from the `LoadBalancingPartition`, freeing up the additional memory. Please delete options that are not relevant. - [x] Bug fix (non-breaking change which fixes an issue) To automatically close an issue: closes #4467

Upgrade Direct package version to 3.37.3

3aa622c

kundadebdatta changed the title ~~[Internal] Direct Package Upgrade: Refactors Code to Bump Up Cosmos.Direct Package to 3.37.3~~ Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. Dec 10, 2024

Code changes to fix trace baseline tests

58eef87

kundadebdatta marked this pull request as ready for review December 11, 2024 12:40

kundadebdatta requested review from khdang, sboshra, adityasa, neildsh, kirankumarkolli, FabianMeiswinkel, kirillg and Pilchie as code owners December 11, 2024 12:40

kundadebdatta self-assigned this Dec 11, 2024

kundadebdatta added the auto-merge Enables automation to merge PRs label Dec 11, 2024

microsoft-github-policy-service bot enabled auto-merge (squash) December 11, 2024 12:41

kirankumarkolli approved these changes Dec 11, 2024

View reviewed changes

aavasthy approved these changes Dec 11, 2024

View reviewed changes

microsoft-github-policy-service bot merged commit 1c070d3 into master Dec 11, 2024
24 checks passed

microsoft-github-policy-service bot deleted the users/kundadebdatta/upgrade_direct_package_to_3_37_3 branch December 11, 2024 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

kundadebdatta commented Dec 10, 2024 •

edited

Loading

Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

Upgrade Resiliency: Fixes Code to Clean-up Unhealthy Connection and LbChannelState Object. #4928

Conversation

kundadebdatta commented Dec 10, 2024 • edited Loading

Pull Request Template

Description

Type of change

Closing issues

kundadebdatta commented Dec 10, 2024 •

edited

Loading