-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocate string literals on frozen segments #49576
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Yes. From ECMA spec: |
It is GC internal implementation detail. We are asking GC to allocate a pinned object. It is up to the GC to do it as best as it can. |
src/coreclr/vm/jitinterface.cpp
Outdated
@@ -11798,7 +11798,17 @@ InfoAccessType CEEJitInfo::constructStringLiteral(CORINFO_MODULE_HANDLE scopeHnd | |||
} | |||
else | |||
{ | |||
*ppValue = (LPVOID)ConstructStringLiteral(scopeHnd, metaTok); // throws | |||
if (!GetModule(scopeHnd)->IsCollectible() && !IsCompilingForNGen() && !IsReadyToRunCompilation()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!GetModule(scopeHnd)->IsCollectible() && !IsCompilingForNGen() && !IsReadyToRunCompilation()) | |
if (!GetModule(scopeHnd)->IsCollectible()) |
This method is never called for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to do a similar change for R2R. The change for that would be in crossgen2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if it makes sense for R2R, since R2R code cannot access the string literal directly anyway, we probably cannot save in terms of code size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save one indirection (ie one instruction) for R2R.
src/coreclr/vm/stringliteralmap.cpp
Outdated
@@ -489,7 +489,7 @@ StringLiteralEntry *GlobalStringLiteralMap::AddStringLiteral(EEStringData *pStri | |||
{ | |||
PinnedHeapHandleBlockHolder pStrObj(&m_PinnedHeapHandleTable,1); | |||
// Create the COM+ string object. | |||
STRINGREF strObj = AllocateStringObject(pStringData); | |||
STRINGREF strObj = AllocateStringObject(pStringData, true); // TODO: don't use POH for collectible assemblies? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that by frequently loading and unloading collectible assemblies, these string literals turn into lots of small holes in the POH that we effectively cannot reuse? If so, I think it is better for us to not allocating them on the POH.
Is there a scenario where we can demonstrate how well this optimization works compared to what it was? |
Will post a jit-diff. Initially I wanted to optimize |
jit-diff is quite promising (--pmi -f):
It not only removes indirect loads but also folds some nullchecks we don't fold now, example: sharplab.io /cc @dotnet/jit-contrib There are regressions because indirect loads become constants and JIT prefer not to do CSE for them intentionally (see PS: We might want to lower this weight for |
I'm wondering how the improvement shown by jit-diff translates to actual perf wins. this means you could potentially pinning way more objects than what do we now. even though putting these on their own heap means grouping them together, it still means that group cannot move which prevents GC from compacting them away to form larger spaces around them (and this will affect regions more than they affect segments). the question is if this is worth the speed increase. of course this is scenario dependent. do we know of some customer scenarios where we could try this change with that demonstrates the end result of this change, ie, including the GC perf difference? |
The string literals live till the end of the process, so there won't be any fragmentation caused by these string literals along. There is going to be a fragmentation only when these very long lived string literals are mixed with short-lived objects. The string literals have same characteristics as the statics arrays that Andrew have moved to pinned heap as well. Are we worried about the POH fragmentation from the statics arrays? I think we need to have a simple rule about whether it is ok to put objects that live till end of the process on the POH. If it is ok, we should take advantage of it anywhere we see an opportunity to gain something from it, no matter how small the gain is. |
Would it help if we can tell to the GC that the object being allocated is going to live till end of the process? |
the object Andrew moved to POH is just the spine. this is pinning both the spine and the objects the spine points to, no? the concern I have is about if we need to allocate more of these as the process runs. if that's never the case then it's fine. if we don't allow collectible assemblies. is this purely during startup? |
@Maoni0 Recently I needed to optimize some pretty heavy type conversion routines. If you can improve the Type as mentioned by @EgorBo we have a very performance sensitive scenario to try it. |
Yes, this is pinning objects that the spine points to. The string literals are typically going to be allocated at the same time we are allocating new spines.
Yes, the string literals are allocated during warm up. Once the process is warmed up and there is no new code being JITed anymore, no new string literals are going to be allocated. The full warm can take a very long time in large apps, even hours, until all codepath through the app get exercised. |
@jkotas no new jitting can occur as the process runs? even after that potentially long warmup stage? |
Right, JITing can still happen even after warm up:
My point is that there are real world scenarios that warm up for a very long time, or that never fully warm up in practice. For example, some of the Bing services with a lot of code. There is a new version getting deployed before the process gets fully warmed up. |
For my edification, what is the spine? |
Pinned
|
So this change generally avoids the storing string literals in the pinned
|
As implemented right now, it is unconditionally enabled everywhere. We can change it to only kick for non-unloadable code only (that includes non-default non-collectible ALCs). I would like to understand better how much we should be worried about POH fragmentation. The POH allocations will typically have long or infinite lifetime. That's fine on its own. But we are likely going to have a problem if there are short-lived allocations mixed in. That makes me think - should our guidance be to avoid short-lived allocations on POH? |
@ww898, if I am reading the code correctly, |
I'm not aware of any specific tools that rely on strings being/not-being in the frozen object heap, but agreed with @ww898 that we need to preserve the invariants that already exist for other allocations. @davmason - we need to get this on the TODO list for .NET 8. EDIT: Another one we need to consider is how this interacts with EventSource AllocationTick events for sampling. We probably want to be consistent that if we consider this to be an allocation for ICorProfiler then we also need to consider it an allocation for EventSource. |
I was thinking on this further and there are more diagnostics we need to go through on this... Anywhere that we enumerate contents of the managed heap or sizes of the heap we need to include the Frozen Object Heap. We've got a few possible conceptual designs: Whatever choice we make here could impact a variety of APIs, tools, and docs. For comparison, look at the diagnostics work involved when adding the Pinned Object Heap: dotnet/diagnostics#1408 I was chatting with @Maoni0 about whether we need to diagnostic tools to have visibility about Frozen Object Heap as a distinct region and it was unclear how big we expect this heap to be? If the heap is relatively small it is easier to sweep it under the rug, but as it grows larger the more people will care about the visibility. |
(b) option makes the most sense to me. FOH is very similar to POH, except that it cannot be ever collected (once the object is allocated, it is stays there until process exists). I think it is important for the diagnostic tools to be able to tell the difference between frozen object and other objects, so they can exclude them from analysis. Otherwise, you may see inquires like "I have a ton of string objects in my heap that are not referenced from anywhere. Why is the GC not collecting them?".
The size of the frozen object heap is roughly proportional to the app code size in the current usage pattern. If you have an app with a lot of code and very little actual data, FOH can be a large fraction of managed memory; and vice versa. I would expect the FOH to be low single digit percent of the total managed memory in typical case. |
Thanks @jkotas! From #75738 you were mentioning:
My take on option (b) is that all segments in the Frozen Object Heap, including those from read-only file-mappings, would count towards the GCHeapHardLimit. Is that what you were thinking as well now or did you have another way of framing the story that lets people reason about this proposed distinction? I want to ensure that if we decide to follow option (b) we all agree what it implies. If we all agree that FOH requires explicit visibility to avoid confusion (no pushback from me) the other question is "does the perf benefit of FOH justify the full costs of supporting it?" My guess is that fully supporting option (b) entails:
|
I feel like I should have paid this more attention when this started because the diagnostics part of estimate was clearly missing. so that's my bad. I can see that FOH can be beneficial for other things long term to justify the fairly large amount of diag work needed but that requires more support other than diag, for example to construct such a heap. does the string literal part justify the cost of diag work? that's hard to say. I have not seen much data (if I have seen any data, I have since forgotten what it looked like). so it seems like a good step to show some perf data. |
String literals themselves probably don't justify the effort, but I planned to move other types of objects there, e.g. Other candidates for FOH(or, again, POH):
PS: NativeAOT uses FOH too, don't we want to have a first-class diag for it too? |
I got convinced earlier that the FOH is the right solution for frozen string literals and the other types of objects that we plan to use this for. I expect that using POH for this would be conflicting with other uses of POH. POH fragmentation issues are hard to reason about, and using POH for frozen string literals would make it even harder. I think the code quality improvements that we are getting from this change have favorable ROI compared to out other investments, even when you include the extra diagnostic costs.
For the most part, this is just stamping what was required for POH. Do we expect that we create more types of specialized heaps like this in future? If yes, can we do something to make it cheaper? |
Thanks all!
OK, that works fine for me. I just wanted to be sure the new information around costs didn't change the decision making.
Yeah, I think of it as POH work + these two additional areas:
I'm hoping we can stop adding more specialized heaps (at least ones that need public visibility) to prevent the conceptual overhead for .NET devs doing memory analysis from growing ever larger. Many of our users appear to struggle to understand the GC space and I assume every additional specialization raises the difficulty bar further. Does anyone anticipate needing more of them?
It might change in time, but at the moment NativeAOT diagnostics is low priority. @jkotas - did you have any thoughts on the GCHardHeapLimit part? You were proposing that file-mapped segments wouldn't count towards the limit earlier and I am not sure if you are still proposing that? @EgorBo - I'm guessing more work has emerged here than you were initially planning on? I'm not sure how much you are prepared to take on yourself vs. getting more devs involved. We may want to pull in @davmason and chat about what makes sense here. |
I don't have anything in mind for foreseeable future. note that FOH is not a new feature - it existed before I even started working on the GC. because it was used so rarely the support wasn't there (including both diag, creation and etc). it's just that now we are expecting to use it much more prevalently so we are paying the (long overdue) debt. |
I think we need to distinguish between what the .NET memory analysis tool writers need vs. most .NET devs need. .NET memory analysis tools writers want to understand and have access to every relevant runtime detail. It is the reason for exposing details like POH or FOH in the profiler APIs. The people writing these memory analysis tools are experts and eager to take advantage of these details to make their tools shine. I agree it is important to keep the concept count low for .NET devs, ideally to just core ones like GC generations. I do not believe that .NET devs need to know about POH or FOH. It is good-enough approximation to think about these as part of Gen2. The corollary of this is that it makes sense to me to expose FOH in profiler APIs that are targeted at .NET memory analysis tool writers, but I would not expose them in the (default set of) profiler counters that are targeted at .NET devs.
I think about the GC hard limit as private working set limit. Do you agree? If you agree, the file-mapped segments are not private working set so they should not count against the limit. I do not have strong opinion about this details and I would be ok with counting all frozen heap segments against the hard limit without exceptions. If somebody has scenario with file-mapped heap segments, they can workaround this by artificially bumping the hard limit up. |
Are the frozen segments read-only or copy-on-write? |
Good point. The frozen segments are not read-only in general case, even when file mapped, so that things like |
they won't be for regions. GC will not touch these segments as they are completely out of range that GC handles (there's actually not much of a point to even register these with the GC for regions anymore). if we disallowed things from user code that would modify these objects like lock we could really make these read only. |
I would tweak this into three groups:
I expect groups (1) and (2) will often want to gather this information (even though it won't be relevant to all investigations). Group (3) won't care, but might get inadvertently exposed to it via docs, blogs, APIs or performance counters.
Agreed in principle at least :) We already did expose LOH and POH in the System.Runtime counter set so it would feel odd to leave out FOH only. If I got a do-over I wish I could put less GC info there and instead put it some advanced GC area. I'm trying to work with the OpenTelemetry team to create some default dashboards that are simpler and don't expose every counter in System.Runtime by default.
I think of the limit as being measured the same way as GC heap size or GC commited bytes. If GC includes private+shared in those calculations then I'd expect limit to be private+shared as well. This is how I was thinking of the options:
Of those options I lean towards (4) at the moment, but if there are other pros/cons I am leaving out I'm all ears. It sounded like the COW issue may have driven everyone to option (4) as well? |
I think that POH (and FOH) details are only relevant in small fraction of the investigations. I would argue that majority of people doing (2) do not need to know what they are and just treat them as Gen2. And if these details are relevant for an advanced investigations, you are likely going to need much more than that. I agree with you that we need to be thinking more about basic and advanced categories. Do you expect that we will remove some perf counters from the default set (e.g. what dotnet-counters displays) as part of that?
I do not think that it is odd. We allocate the string literals and runtime types in regular heap today and nobody asked for a counter that shows how much these items take. They just get included in Gen2 size (once the app runs for a while). So we can just continue doing that and avoid teaching everybody what this new number means. We moved around some internal implementation details, but it does not mean that this move needs to leak into what everybody sees.
Sounds good. |
I am avoiding removing anything from System.Runtime, but I do think we have flexibility in the dotnet-counters UI not to show everything in System.Runtime by default. My plan was to first focus on the default dashboards with OpenTelemetry, then see what we can bring back to dotnet-counters from that exercise.
I am more worried that if we start disclosing the existence of FOH elsewhere (in VS, in SOS, in docs/blogs, in memory profilers) then people will become aware of it, look at the list of counters and ask why is every heap except FOH here? How is it being accounted for? They may also do things like compare the output of GC.GetMemoryInfo().GenerationInfo[2] or SOS !DumpHeap -stat to the reported value of the GC gen2 counter and become mistrustful of the data if it doesn't match. We have some accounting mechanisms that treat FOH, POH, and LOH as part of gen2 and others that treat none of them as gen2. This would create a 3rd way of accounting, used only for performance counters, where POH and LOH are distinguished but FOH has been folded into gen2. I worry that more accounting variations == more confusion. |
Should we make a breaking change so that FOH, POH, and LOH are treated as part of gen2 everywhere? It would eliminate the current confusion about the accounting, and create a path forward for FOH without raising the concept count for everybody. |
Only if we have counters that also have the break down. @noahfalk and I have been discussion another specific GC category for "advanced" counters. |
POH and LOH are logically part of gen2 (meaning that they are only collected during gen2 GCs) and that has always been the case. that doesn't mean we will not distinguish them - they still have different perf characteristics so for more advanced diagnosis scenarios they should be displayed/analyzed separately. |
I agree that the breakdowns are useful for advanced analysis. I am not suggesting to remove them. My suggestion has been just to fix the confusion about what "Gen2" means, and make it mean the same thing everywhere. Ie make this formula to be universally true: |
I am skeptical that we might make some breaking changes and still not have reduced confusion that much. Even if we are very diligent to use terminology only one way going forward, we can't prevent other people and tools from referring to I was expecting that the cat is out of the bag on this one and the best we can practically do is acknowledge that "gen2" has different contextual interpretations and make the context as clear as possible when we use it. |
UPD: This PR allocates all string literals (from non-collectible assemblies and dynamic contexts) on the Frozen Object Heap - it allows us to optimize a redundant indirection:
Was:
Now:
Inspired by @jkotas's comment #49429 (comment)
We can do the same trick for other things: typeof(), static fields, etc