-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PartitionSupervisorCore.RunAsync leaks System.Threading.CancellationTokenSource+Linked1CancellationTokenSource #4208
Comments
@jahmai-ca Thanks for the report. I'm trying to follow the path you described but I'm having a hard time.
One clarification: It is not a loop. Once the split is handled, the lease is released. The current or another ChangeFeedProcessor instance can Acquire it and start processing. Getting continuous |
Not sure we're looking at the same code here...
It's not readonly: Line 22 in 07aa28e
It can't be because it's being assigned outside the constructor by RunAsync : Lines 32 to 36 in 07aa28e
It also throws it up to the caller: Lines 57 to 61 in 07aa28e
The caller ( Lines 149 to 158 in 07aa28e
Lines 172 to 182 in 07aa28e
Line 93 in 07aa28e
I might be reading it wrong, but it seems to me that there is definitely a code loop there. Not sure what precise conditions are required for it to go around and around though. |
@jahmai-ca You are right, the readonly is only on the The loop whoever what it does is create new PartitionSupervisors right? Let's assume there is 1 lease in the beginning.
So yes there is a loop, the loop however all it does is create new Supervisors, the old Supervisor is not used anymore and would eventually be Disposed. RunAsync is not called on the same Supervisor instance again. This is what I mean that I don't see a leak. The Controller instance remains the same always, it is the Controller who spawns Supervisors and lets them run independently. The only scenario where the CTs might leak is because the GC is not collecting/disposing the unreferenced Supervisor that are not used anymore? |
My guess is only if the GC is not Disposing the PartitionSupervisor instances after they go out of scope. CFP has been around for 6+ years and this is the same code in the V2 version: https://github.com/Azure/azure-documentdb-changefeedprocessor-dotnet/blob/master/src/DocumentDB.ChangeFeedProcessor/PartitionManagement/PartitionController.cs We have never had any reports of this behavior, even on containers with thousands of partitions (hence thousands of supervisors). That's why this is kind of weird. Like I said, we are not explicitly calling Dispose (which I guess is something we could do) but the normal behavior is that Dispose on the PartitionSupervisor is called by GC. Unless in your case, GC is not running? |
GC is definitely running. If it wasn't our memory dump would contain many other allocations that our app does. The GC isn't actually responsible for calling Dispose. It only calls Finalize (which may in turn call Dispose), but I don't see any finalizers defined for the types in question? We've been using CosmosDB SDK for years too. On reflection this may have happened a few times but it is exceedingly rare. Normally we'd just restart the affected services and the problem would go away for a long time. On this occasion I took the time to capture a memory dump. |
What we can do is explicitly call Dispose on the Supervisor, that should at least take care of disposing the CTs no longer in use. |
See linked PR |
Describe the bug
Instances of
System.Threading.CancellationTokenSource+Linked1CancellationTokenSource
are leaked by thePartitionSupervisorCore.RunAsync
method via calls toCancellationTokenSource.CreateLinkedTokenSource
which are stored in theprocessorCancellation
member.The caller,
PartitionControllerCore.ProcessPartitionAsync
, has a catch clause that handlesFeedRangeGoneException
, and callsPartitionControllerCore.HandlePartitionGoneAsync
, which eventually loops back around toPartitionControllerCore.ProcessPartitionAsync
again, which callsPartitionSupervisorCore.RunAsync
, when in turn allocates anotherCancellationTokenSource
, but without cleaning up the previous instance.To Reproduce
I don't know how our application got into this state, but by the time we captured a dump, there were 6.7 million instances of
System.Threading.CancellationTokenSource+Linked1CancellationTokenSource
allocated along the call stack mentioned above.I suspect the leak was a side effect of some kind of infinite loop from
FeedRangeGoneException
being thrown over and over again, but I have no idea why this happened or how to reproduce it.Expected behavior
The SDK doesn't leak memory.
Actual behavior
The SDK leaks
System.Threading.CancellationTokenSource+Linked1CancellationTokenSource
instances in the extreme.Environment summary
SDK Version: 3.32 (I know this isn't the latest but there is no change to memory management that I can see in the most recent code).
Windows Server 2022
net 7.0.304
Additional context
The text was updated successfully, but these errors were encountered: