-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: NoGC callback #66039
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsBackground and motivationWhen people use the TryStartNoGCRegion method, in general, they are trying to prevent a GC from happening. However, this can be done as much as the allocation stays within the Ensuring the application does not allocate more than API ProposalCustomers are asking for a callback when that happens and let the customer decide if we should
To do that, we can add a The /**
* This mode specify what the GC should do when we ran out of totalBytes within the NoGCRegion
*/
enum NoGCRegionMode
{
/**
* This means when we ran out of totalBytes allocated earlier, we will perform a GC.
* This is the default behavior.
*/
GC = 0,
/**
* This means when we ran out of totalBytes allocated earlier, we will try to allocate more memory
* from the operating system and let the application run without further GC.
*
* This is not always feasible - in segments, if we ran out of the segment size, then we will fail.
* or in regions, if we ran out of physical memory, we will fail as well.
*/
ContinueAllocating = 1,
/**
* This means we will let the application fail fast by throwing an OutOfMemoryException
*/
Throws = 2,
}
### API Usage
```c#
GC.TryStartNoGCRegion(12345678, NoGCRegionMode.ContinueAllocating) Alternative DesignsSince the user asked for a callback, it would be natural for us to provide a managed callback when that happens. We decided not to do that because it probably won't do what the user wants. Most likely, the user would like to trim some cache and give space for the application to continue to run at that point without incurring a GC. We have considered several implementation possibilities (e.g. serving a callback on allocation, throwing an OOM and catch, letting the user trim caches, etc ...), and nothing would achieve the goal that the user would like to. That is why we decided against the callback and offer the RisksNo response
|
@lucabol, this is meant to address #11733. |
Yep, we have three requests now for basically the same thing. Let's hope it gets in. |
this is the API proposal from the GC team :) please do let us know if the proposed API addresses your concerns. we'll give it some time for discussion then take it to the API review. |
Sorry @cshung @Maoni0 . I didn't realize this was the proposal :-) This is a good proposal. Can you give more details on why exactly the callback wouldn't work? As it is the more general solution, it would be good to know why not. BTW: could the user catch the |
Let's consider case 1, where we are attempting to save the current execution by allocating more memory. Remember we discovered the situation that we have already run out of preallocated memory, so we really cannot run more code before actually allocating more. Even if we do allocate more just to serve the callback, and assuming the amount of memory fits the callback's need, where would the callback be called? On the current thread? It would have a weird stack that an allocation suddenly calls another method. On a thread pool thread? What to do with the current thread right now? That (and any other threads that might allocate) cannot proceed. What if we ran out of memory again during the execution of the callback. This just turns into a lot of technical issues that we simply don't have answers to. Now let's consider case 2, where we give up the current execution and throws an OOM on the spot. Now one could catch the OOM, but what can that exception handler do? First of all, it cannot allocate more memory (as it will fail right away again), so that excluded a lot of useful things that a user would like to do. One might be able to set null to certain references (e.g. global caches) in an attempt to have more memory to use, but even in that case, that memory won't be usable until another GC kicks in. Last but not least, an OOM can be thrown virtually anywhere. After you caught the OOM, are you sure you can recover from all possible partial states the process left off before the OOM happen? That would be a challenge for many people. It is much easier to hard code these options and let the native runtime do it. Much of the limitation above comes from the fact that the processing of the situation cannot involve allocating more managed memory, this can easily be side-stepped by doing it in the native runtime. Of course, we lose generality, but do we really need it? Let's flip the question around. If you had the callback, what would you do with it - given you know all these limitations? |
In the most common case, I think, we can consider running out of I think that is solved by catching the OOM exception. That was the original intent of the callback. So I think I am good with it. |
I think this is good, but I think we should clarify what does And what if the allocation failed? GC or OOM? Or we can let it be 2 enum values |
JITing methods that is going to happen as side-effect of running the exception handlers would likely cause some managed allocations too. For example, the JIT allocates string literals on the GC heap. These allocations would keep failing too. I think that the If we want to allow the application to react to running out of no GC region, I think a callback that is delivered on finalizer or threadpool thread in combination with allowing more allocation to succeed is the best option. |
I was thinking about We don't have a @WenceyWang, what does (*) Exactly how much we ask the operating system to serve the request is an unimportant implementation detail. All we needed is to make sure the allocation succeed, and therefore it is not necessary for us to expose that implementation detail. (**) Without actually getting our hands on the implementation of this proposal, I am not sure exactly what does it mean by that limit. In the case of segments (which is the current mode right now), we cannot extend Gen0 larger than a segment, so the |
I think there's only a little difference for me between the two modes. As if we use If we have I think it's good enough for me to see a ps. I am reading https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/ now |
While there is nothing stopping you from catching an OOM exception, the recommendation is that you don't. If you wanted to display UI, you will need memory, and you don't have any, so you will have another OOM again. It just won't do what you wanted, might as well let it crash so that the system will handle that for you (e.g. talk to Watson on Windows, create coredump on Linux, ...) |
Exactly. The problem is that code out there has finally blocks and other exception handling processing for cleanup, etc. This code is also likely to going to try to allocate memory somewhere, this allocation is going to fail again, and the cycle will repeat. It is not unusual for these situations to end up with stack overflow or other hard to diagnose situation. The write up above says "This means we will let the application fail fast by throwing an OutOfMemoryException". Fail fast (as in Environment.FailFast) is not the same as throwing OutOfMemoryException. Fail fast terminates the process without giving it any chance to react. OutOfMemoryException is a regular exception that can be handled. Could you please clarify which one of these two behaviors you plan to implement? |
It's kinda common in a game to load these fail-safe assets into memory while on the loading screen. displaying an error UI just means us to change a bool flag somewhere and the rendering thread will do their job of changing what to draw to the swapchain, this is done by unmanaged code and I don't think it needs memory allocation while doing this. What we'll do in managed code is just catch the OOM, change a flag in a pinned object, GC.EndNoGCRegion and the control flow will continue to next round the main loop, then we can check a fast resume checkpoint or show the menu, just like what it is as the game process starts. I am sorry as I didn't realize that the If there's only FailFast() possible, we will have to start a watchdog process for this and it will be much more painful as all these unmanaged assets will also be needed to create again. |
At the interface, Right now, in this particular case, the VM checks for that Suppose the VM can check if it is in a NoGCRegion and throws a different exception instead. Then, without violating our usual advice, that exception can be caught. Also, suppose we terminate the NoGCRegion before returning This is very optimistic, the true challenge is whether or not the VM is ready everywhere it allocates from the GC heap, which I am not sure about. |
It is not just VM. The problem affects all .NET libraries out there. The problem with injecting OutOfMemoryException in random places is that it has a good chance to leave the process in irrecoverable state, or that the exception will be ignored and thus your catch handler won't get notified. The simple example is if this happens inside static constructor. The static constructor is going to marked as failed and it won't be ever rerun.
If you are going to automatically turn off NoGCRegion anyway, I think it is better to let the allocation succeed and deliver the notification about running out of no GC region budget as callback. The callback can be passed in as an argument for |
But we do have a mode of letting the allocation succeed, it is the |
OK, As I imagine, there are 4 mode
And a callback with event info about the status of ContinueAllocating, which happens
And also I am also happy to see the current proposal can catch the merge window if there's no time. |
The second case is still problematic, in essence, we cannot do a callback when we really don't have memory. Let me make sure we understand by going a simple example. To make things simple, let's use tiny numbers. Let's say right now we have 100 bytes of memory. You have used 20 bytes already, and you just started a NoGC region with 10 bytes. So you allocate object #)1, succeed (remaining 7 bytes) The default behavior today is that your 4th allocation will automatically leave the NoGCRegion and perform a GC, and this proposal is about providing alternatives for that. If you chose ContinueAllocating, the GC will ask the operating system for more memory, let say, 15 bytes. This number is implementation-dependent, but now your allocation #)4 will succeed. (*) So you allocate object #)4, succeed (remaining 13 bytes) At this point, we will continue the ask the operating system again for even more memory, until eventually, you reached the system's limit. So you allocate object #)n, succeed (remaining 1 bytes) At this point, GC can no longer allocate, and it will fail the allocation with OOM (**) We advised you not to, but if you catch the OOM, the OOM handler may continue to allocating, leading to allocate object #)n+2, and you get an OOM again, and the program still crash because now you are crashing within your catch. Firing a callback at (**) time has exactly the same effect, your callback will allocate memory, and worse, the OOM in the callback will cause another callback, and you eventually run out of the stack too. Another possibility is that we do a GC anyway at (**) time. Doing a GC is probably better than a crash. Our suggestion is that we offer you a callback at (*). The idea is, you still have 70 bytes at that point, more than sufficient for you to gracefully terminate the game, show your goodbye screen, do your logging, whatever your game needs. To make that work, your NoGCRegion setting must leave some room for maneuver. Let's say your callback needed 10 bytes, then if you set your NoGCRegion to preallocate 75 bytes, leaving only 5 bytes remaining for your callback. Then your callback will fail with a OutOfMemoryException, and that is no longer recoverable. If you wanted to do a GC instead, you can simply terminate the NoGCRegion at the callback, this will end the NoGCRegion and perform a GC at the spot. That would cover the case for ContinueAllocatingThenGC, there is no need for a separate mode for that. If you chose Throws, then the #)4 allocation will fail, and the application will crash. |
I think we should automatically terminate the NoGCRegion before issuing the out of space callback. The whole point of NoGCRegion is to maintain certain SLAs. The moment you run out of space, the SLAs are not going to maintained anyway and so trying to artificially stay in NoGCRegion just makes it difficult to recover from the situation. |
But that is something we already do, the default behavior is to terminate the NoGCRegion once the out-of-space situation happens. |
OK, this is fair enough for me to have a callback while still in NoGCRegion. |
Agree. I am arguing that it is the right behavior and we should continue doing that even for the callback. I think managed callbacks that run with NoGCRegion still in place would be fragile and unreliable. |
and
are contradictory ... |
I think the idea of both @jkotas and me is that if we have both The GC will continue to try to allocate more after (the original |
The discussion in the thread is enlightening. I think I have a much better understanding of the problem space, and therefore I revised the proposal (on the very first comment of this thread). Please take a look and see if the new version addresses all the concerns. |
Which thread is the callback invoked on in the proposed design? |
The finalizer thread. |
What is going to happen if the app exhausts the reserve before the callback gets delivered on the finalizer thread? |
In normal cases, that should not happen. No GC occurred earlier, so the finalizer should have nothing to do. But it may happen, for example, if we have some buggy finalizer code deadlocked in the finalizer thread or some other reasons that I am not aware of. In any case (i.e. the callback delivered or not, the callback completed or not), when all the reserves are exhausted (i.e. the larger limit), we will do a GC and exit the NoGCRegion automatically, just like it was without this change. |
If app is allocating at high rate, it can easily eat the reserve just in the time window it took the thread scheduler to invoke the callback on finalizer thread. Would it make sense to decouple this callback from the no GC regions? The callback is basically about providing a notification once the certain amount of memory was allocated. It can be completely independent on the no-GC regions. |
@jkotas, I am wondering if we have a misunderstanding on the scenario and the feature. As I understand it, the NoGCRegion feature is meant for games where they have already preallocated most of their resources and would like the gameplay to be smooth. So my expectation is that the callback is meant for edge cases and allocating at a high rate during NoGCRegion is something that should not happen. That was what I meant by "in normal cases" above. Are you having an alternative scenario in mind? I am wondering if you could share that. It is hard to think about decoupling the callback from NoGCRegion without knowing what problem that solves. From a caller's perspective, it might become harder to use, and from an implementation's perspective, it incurs a cost of checking whether or not we need to issue a callback for every allocation that reaches |
Yes, we use scenarios to motivate APIs. Once we arrive at set of APIs that are addressing the scenario, we look at what these APIs actually guarantee and whether the design can be simplified or generalized while providing same guarantees. The callback in the proposed API is only guaranteed to be delivered after certain amount of memory gets allocated. There is no guarantee whether the callback gets delivered before the no-GC region expires. It suggests that it is unnecessary to couple the callback with no GC regions. I am sure somebody would find uses for generalized callback like that are unrelated to the original motivating scenario. |
We have 4 overloads of the TryStartNoGCRegion method already. Do we need to add callback overloads for each of them? It gets complex to pick the right overload to use from so many different ones.
Yes, this check would be on slow path through the allocator. I do not see it as a problem. |
I questioned the scenario mainly because lifting the restriction to NoGCRegion may make the design and implementation more complicated. Here are some design-level questions that we won't have in the NoGCRegion case.
Without a scenario, these questions have no good answers. And I agree, the number of overloads is concerning. |
This new change is good enough for me. But for the idea of API designing, I think I can talk some, as I think there might be a huge gap between you framework developers and the game developers. I have talked with some of my indie game developer friends several days ago. They are mainly designers, not cs majors. While I told them the GC might cause pause, they ask me: How to get rid of this thing? They have no idea about GC at all. As I told them there's GC.TryStartNoGCRegion and they ask me what to put as the argument. So I think the idea that easy for them of the whole NoGCRegion API is simple, it's something like
GC will try to do everything they don't know to keep the NoGCRegion, and If GC thinks there's no way to preserve it, the callback is called, once. And after the callback is invoked, no matter if the callback is actually being executed/returned, a GC can happen if any new allocation is needed. Also, I think this is the idea of the And still thank you for the enhancement. |
This looks very close to what the existing |
namespace System;
public partial class GC
{
public static void RegisterNoGCRegionCallback(long totalSize, Action callback);
} |
Just wondering what the status of this one is. I opened the first one 4 years ago, there was a long discussion on March/April of last year culminating with an API proposal that was approved. It looked promising at that point. But then it was put into the Future milestone, which looks less promising :-). So, what is the status? Do we have an agreement on the API and just waiting to be prioritized for development, or have we encountered unforeseen conceptual issues? |
@cshung is actually currently working on this. I think he should have something you can try soon (if you want to try a daily 8.0 build). |
Great! Happy to try it out. Let me know when/how. |
Background and motivation
When people use the TryStartNoGCRegion method, in general, they are trying to prevent a GC from happening. However, this can be done as much as the allocation stays within the
totalSize
specified upfront.Ensuring the application does not allocate more than
totalSize
is difficult, if not impossible. Right now, the GC will perform a GC when thetotalSize
is exceeded and the GC will have to garbage collect. This is okay from some applications, but for some others, it might not.API Proposal
Customers are asking for a callback when that happens and let the customer decide if we should
We had some discussions (see below) to explore the actual use case and implementation constraints, and we discovered a key conflict between them.
From the requirements perspective, we would like to have a callback when the memory is exhausted so that we can gracefully terminate the game without a GC, but
From the implementation perspective, we cannot serve a callback that could potentially allocate more memory without a GC while we have already exhausted our pre-allocated memory.
To resolve this conflict, we notice that exactly exhausting the memory is not necessary from the requirement side. As long as we have a callback issued so that the game can gracefully terminate, this is good enough.
We also notice that nobody need the process to be terminated right away.
Therefore, I revised the proposal as follow:
API Usage
Alternative Designs
One alternative design is to expose a enum what to do when the NoGCRegion is exhausted. The options are to terminate the NoGCRegion, to commit more memory, or to fail fast. This is decided to be not good enough because what customer really wanted to customizable logic to show some screen and end the game gracefully, none of those options allow them to do that safely.
Another alternative design is to add parameters to the TryStartNoGCRegion. That will work, but there are two reasons why we wanted to have separate APIs for starting to NoGCRegion and registering the callback.
Risks
By asking the caller to provide two limits, the caller must estimate how much memory is necessary for serving the callback, which circles back to the initial question that it is not easy to estimate memory usage. The key idea here is that the callback is supposed to be some simple thing, as @WenceyWang suggested below, it will simply be switching a flag and using some preallocated resources. That hopefully ease the problem.
Implementation-wise, we will reserve up to the larger limit to begin the NoGCRegion. Therefore, there is no risk to issue the callback without GC. When we reach the larger limit, we will terminate the NoGCRegion regardless. That way we can maintain our implementation reliability.
The text was updated successfully, but these errors were encountered: