[FIX] Dispose StatelessWorkerGrainContext when it contains no more workers#9636
[FIX] Dispose StatelessWorkerGrainContext when it contains no more workers#9636ReubenBond merged 3 commits intodotnet:mainfrom
StatelessWorkerGrainContext when it contains no more workers#9636Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a memory leak issue where StatelessWorkerGrainContext instances were not being properly disposed due to a timer keeping a reference to the context. The fix ensures disposal by explicitly enqueuing a DisposeAsync command when the last worker is destroyed.
Key Changes
- Adds explicit disposal command when worker count reaches zero
- Prevents memory leaks by ensuring proper cleanup of StatelessWorkerGrainContext instances
| var completion = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously); | ||
| EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(completion)); |
There was a problem hiding this comment.
The completion variable is created but never used or awaited. Consider either using this TaskCompletionSource to wait for disposal completion or removing it if not needed.
| var completion = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously); | |
| EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(completion)); | |
| EnqueueWorkItem(WorkItemType.DisposeAsync, new DisposeAsyncWorkItemState(null)); |
There was a problem hiding this comment.
Neither is correct, for one the DisposeAsyncWorkItemState expects a non-null TCS, and awaiting the tcs.Task would deadlock the loop. Eitherway I hooked a continuation to at least log any potential errors that might occur in the dispose internal method.
gfix
left a comment
There was a problem hiding this comment.
This looks fine to me! I am, however, not familiar enough with how activation/deactivation works to understand why DisposeAsync is not called when the SW grain is deactivated?
| catch | ||
| if (_inspectionTimer != null) | ||
| { | ||
| await _inspectionTimer.DisposeAsync().AsTask() |
There was a problem hiding this comment.
why is the change here needed? what is the benefit of AsTask().ConfigureAwait(...) against former DisposeAsync() -> only to suppress exceptions?
There was a problem hiding this comment.
only to suppress exceptions
Yes, this is an "just in case" thing so the rest of the method continues
| @@ -172,9 +172,20 @@ private async Task RunMessageLoop() | |||
|
|
|||
| if (_workers.Count == 0) | |||
There was a problem hiding this comment.
Can there be a scenario where _workers.Count reached 0, but in the queue there is an item to process WorkItemType.Message which will then create a new worker? In that case a new worker will be created right after.
There was a problem hiding this comment.
very unlikely since the destruction of the workers comes after the fact of a grain disposing (which means it wont receive more messages), but possible! If that happens, the dispose command will be processed after that message.
There was a problem hiding this comment.
The alternative would be to stop the disposal if a worker exists within the SW context, we should not do that, we should respect the call for disposal from outside!
|
And a dumb question: there is a Can u please elaborate on when a context should be disposed in a normal scenario? |
That is correct, there is one SW Context per grain GrainAddress (techically per GrainId-SiloAddress, but does not matter here). |
|
One thing we could do is to not dispose the SW context if the strategy does NOT specify idle worker removal so it would be aligned fully with previous behavior in all fronts |
…d be performed or not
did this! |
When the inspection timer is created, the SW context itself is passed as the
stateargument. When all workers are destroyed the reference tothisfrom the timer keeps the SW context itself alive from a GC perspective, so dispose wont get called.This PR enqueues a
DisposeAsynccommand when there are no more workers, as we can consider the SW to be destroyed as well.fix #9634
Microsoft Reviewers: Open in CodeFlow