-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient MemoryAllocator #1596
Comments
Idea 2 is interesting. I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed? For the record, I also hate idea 4, but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts. The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine... I reckon even in 64-bit you'd start paging. |
Implementing the ability to have discontinous buffers was relatively easy, since our processing code usually goes row-by row. We just had upgrade Currently we have a very high limit for images being split to different buffers, what I'm considering is to check what happens if we aggressively lower that limit. I'm unable to assess if it would make thing worse or better.
That's why I want to avoid materializing them :) Will open an issue for tracking "decode into target size" stuff tomorrow.
I was wondering why do are using the basic |
I don't think it would make things worse. If I understand right, your current mode of processing is to apply each transform to the entire image before moving on the next? If so, you're kind of trashing the cache at each stage, so it wouldn't matter whether the memory is contiguous or not. I'm not sure it would make things better either, though. You might want to look at your LOH fragmentation under those high memory conditions and see whether being able to stuff a smaller buffer into an LOH hole would gain you anything. Switching to unmanaged buffers may help a bit there, but heap fragmentation can be a problem in unmanaged as well. Of course being able to aggregate small buffers in place of a single larger one would be of benefit, provided you don't need them at the same time. I assume if that works for RecyclableMemoryStream, it should work for ImageSharp 🤷♂️
I must admit I didn't put that much thought into it. But yeah, the point was to make sure there were enough threads running to check whether the libraries were capable of true linear scale-up. As long as there are free processors to hide work on (as Vips does today -- and ImageSharp did before your windowed convolution implementation), you can't be sure. If you think about a high-load web server scenario, you would assume the managed thread pool would fully occupy all processors, so I want to know how a library will perform in that environment -- and to be sure that a large number of image processing operations doesn't make for a self-DoS. BTW, I suspect the anomalous CPU measurements I got for Vips running under the profiler might have actually come from an oversubscription problem. Could be the CPU overhead of the profiler was just enough to throw the whole system into chaos and drop the speed by several multiples. I plan on doing some followup testing to see if I can nail that down. |
Couple of other interesting stress cases you might want to look at.
|
The reason I'm thinking about "discontiguous memory by default" is not cache coherence, but the fact that it may enable more efficient pooling, reducing the GC pressure (or OS allocation traffic in case of unmanaged buffers). With large images it is very likely that a the requested buffer will outgrow the pool deferring to GC/OS allocating new giant arrays of non-uniform sizes, which would probably result in heap fragmentation even if we are using unmanaged memory. I'm thinking of a solution that works as following:
@saucecontrol any thoughts on this particular plan? I wish I could pull in a GC expert to assess it ... |
That plan looks sound to me, but I'm no GC expert. I do expect that inability to grab a contiguous large block of memory due to LOH (or process heap) fragmentation is at play in the 32-bit OOM scenario, but it would be good to verify that before diving in to the work if possible. It seems like if you end up doing the work to completely manage the allocations, you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that. Plus, you'll probably always have to deal with some allocation requests that will be over whatever pooling threshold you set, and you'll want those to go straight to unmanaged so they are as short-lived as possible. |
The problem is that I don't really know when to free up memory. With LOH + the |
@JimBobSquarePants it just allocates / returns unmanaged memory without pooling. Maybe it's good enough (or even preferable) approach with unmanaged memory, not sure. My main concern about unmanaged memory is that I would be hesitant to use it as a default, because of the users who may implicitly depend on not disposing |
Instead of holding the raw internal sealed class SafeHGlobalHandle : SafeHandle
{
private readonly int byteCount;
public override bool IsInvalid => handle == IntPtr.Zero;
public SafeHGlobalHandle(int size) : base(IntPtr.Zero, true)
{
SetHandle(Marshal.AllocHGlobal(size));
GC.AddMemoryPressure(byteCount = size);
}
protected override bool ReleaseHandle()
{
if (IsInvalid)
return false;
Marshal.FreeHGlobal(handle);
GC.RemoveMemoryPressure(byteCount);
handle = IntPtr.Zero;
return true;
}
} Since It might be worth implementing something like that for any allocations larger than your max pool buffer size just because you know those will happen and will have a large GC cost. It's also a lot easier to trial than the segmented buffer idea. |
FYI dotnet/runtime#52098 is tracking some possible scenarios to improve in ArrayPool/GC interaction. I think Jan's scenarios 1 and 3 (and maybe 2 as well) are applicable to ImageSharp. |
There are so many things I hate about ArrayPool, especially All the magic ArrayPool.Shared is doing should be a runtime/GC implementation detail hidden behind standard |
Hi everybody! First of all, current state of pooling already supports discontiguous buffers. It's just hidden behind First it checks if single stride can fit in single contiguos buffer:
Then it does strange things: ImageSharp/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup{T}.cs Lines 99 to 107 in a8cb711
Details aside - this alloc call either throws or pools single contiguous buffer each time. So here goes first problem: most of the time pool is stressed by huge arrays from buckets of higher end of indices when low indices are empty which makes Edit: huge mistake in calculations and hardcoded upper bound of pool max size (it's hex, not dec), this still implies but for certain pool setups, see post below. const int MinimumArrayLength = 0x10, MaximumArrayLength = 0x40000000;
if (maxArrayLength > MaximumArrayLength)
{
maxArrayLength = MaximumArrayLength;
} Which leads to unexpected LOH allocations when working with images up to ~40mb of pixel data and that isn't very rare: Parallel.Foreach bee heads can skyrocket memory usage (been counting gigabytes of work set during execution) with frequent GC(2) collections. |
You've got a couple of translation errors in your analysis. First, the hard-coded But you're right about the LOH allocations. The default memory pool is limited to 24MB per array, so images over ~8 megapixels will cause allocations, and there are definitely images in the bee heads set that exceed that limit. ImageSharp/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.CommonFactoryMethods.cs Line 15 in a8cb711
The challenge is in achieving a balance between avoiding GC allocations and holding a high steady-state memory level when large arrays get pooled. |
Oh gosh, you're right, math was not on my side tonight, thanks for pointing out. Worst part about built-in .net pool is that it was developed for general use-case. Exponential growth of internal buffers would waste a lot of space in certain conditions. Empty buckets are not a big deal but they are just sit there for no reason. Although |
IMO all allocations should be divided into 2 categories: pixel data which outlives any other allocation type in context of this library and everything else. Main memory consumers are pixel buffers which cause massive hits on both GC heaps and theoretical custom memory pool based on unmanaged memory - it'll be much harder to support large buffers while allowing to pool small buffers of integers/small structs. This way we can delegate those small allocations to GC (thx to tiers this won't really affect its performance) while concentrating on storing pixel buffers with proper alignment. Another interesting concept mentioned by @antonfirsov - async/await for memory pooling. This approach could facilitate environments where memory is crucial. But it might be really tricky not to fall into the deadlock when there won't be enough memory for any of the awaiters. |
@br3aker reacting to #1596 (comment) first:
If any of these are true, it's a bug. You should be able to limit the maximum contiguous buffer size by using the following full
Note that the default for I believe lowering the capacity + setting a higher bucket size should help with the biggest "pooling doesn't work in certain situations" problem. If you have a repro proving this fails on the load+resize+save path (= contiguous buffer is allocated or exception is thrown) can you please open a new separate issue for that? |
Regarding the rest of the comments: I believe all mentioned issues can be addressed either by a very basic unmanaged allocator or by the "smart" allocator described in #1596 (comment). None of those would use the BCL
I wish this was the case, but this is not true: #1597. This our top-priority memory issue now, effecting the most common load+resize+save path. Even if we manage to fully optimize it, the "fancy" (blur, sharpen etc.) processors may still keep allocating all kinds of large |
Oh yeah, I've haven't actually checked all the processors. What I meant is that allocator shouldn't be bothered with small allocations like here: ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs Lines 497 to 507 in a8cb711
Those buffers are not actually big: ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs Lines 64 to 66 in a8cb711
ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs Lines 19 to 26 in a8cb711
Without proper investigation I thought several places had this kind of small allocations but I was wrong, that's the only place I could find with #1596 (comment) I changed bucket size prior to experimenting with non-contiguos buffers, sorry for not mentioning that. Will try to reproduce and write proper issue/comment today. |
Hello, I saw this issue pop in my twitter feed and I was curious about it, I hope you'll forgive me for butting in 👀 I didn't see it mentioned, but at least in the scenario of very large pixel buffers, an option could be to rely on Similary to ⚠ One possible drawback could be a performance hit associated with allocating/deallocating the memory pages for the memory mapped file (this would need to be benchmarked), but this should still be worthwile for very large buffers, as a heap-based mechanism would also need to allocate new pages. Of course, this way of allocating memory could also be used with pooling and discontinous buffers. What it does at the core is give you more control over how much physical memory is allocated, or at least reserved in your address space. For the sake of completeness, you can find a rough example of usage of |
Regarding discontigous buffers. If we went that route we would likely not be able to pin anything. There's a couple of places we do so currently. |
@JimBobSquarePants in the current design @GoldenCrystal using MemoryMapped files is a good idea, but it's the most complex option of all, needs a lot of research, and has to be implemented with care. If possible, I would prefer to go with managed LOH arrays, because anything unmanaged is a breaking change (an application leaking |
In case anyone chooses to chime in, but the thread is TLDR: I'm thinking about a custom pool of uniform (~2MB) LOH arrays, that can be used build larger discontiguous buffers to back ImageSharp's Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient? |
Yep but this had me concerned.
Regarding |
Honestly, current problems with OutOfMemory exceptions look like a bug rather than real memory management problems. I can't believe that even in parallel tests 8 (or even 16) cores can spend up to 8gb in its peak scenario on such a simple task (source - beeheads demo by @saucecontrol)
One of the simpliests yet the most 'controlable' thing from user perspective you can do (especially good for bulk processing): using (var context = new Context())
{
using Image img = Image.Load(..., context);
using Image resized = img.Resize(..., context);
resized.Save(..., context);
} Unfortunately, this would lead to a lot of public API changes/additions but it allows to manually control when processing is over and resources can be freed. Problem with LOH allocations is that LOH can't actually trigger GC as reliably as non-LOH allocations do. We have 2 rather rare scenarions of LOH collection: (1) won't be caused by pooling if we use LOH buffers What I wanted to state is that LOH collection is a really rare occasion unless triggered manually which is not what library code can reliably control. Yes, we can try to force gen2 GC collection at pool buffer returnal: // somewhere in pool class
public void Return(byte[] buffer) {
// buffer returnal logic
usedBuffersCount--;
if(usedBuffersCount / allocatedBuffersCount < somePercentage) {
GC.Collect(2);
}
} But this should be done with care as gen2 collection could cause a significant performance hit. |
@br3aker They're the same thing if set which is the default behavior. Locking during Rent/Return occurs in the underlying I agree it would be much better to allow trimming but without completely reimplementing |
@JimBobSquarePants I must be blind, thanks for clarifying about locks, didn't see that Yep, new solution would require entirely new pool implementation which I'm willing to try in the upcoming week(s). I'll open a draft if anything works out. |
@br3aker on .NET Core there is a trick to detect if a gen2 GC is happening: What I would do is to trim some of the retained buffers on every gen2 GC (eg. 1/2 or 3/4 or similar). I think this would handle the cases you pointed out in #1596 (comment) without complex statics. Note that the contextual batching API you referred to in #1596 (comment) already exists: you can create an pass around a |
@antonfirsov yeah I understand about "better memory handling by default", contextual batching api idea was more of a feature around new built-in allocator. Current API is a bit misleading with creating a new allocator and then manually calling |
@antonfirsov Would the 2MB uniform limit have any affect upon our ability to work with wrapped buffers. e.g |
ImageSharp/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup{T}.cs Lines 179 to 198 in 97dde7f
|
This would seem to be the way in the short term. The way it trims in response to I think regardless, you'll end up with times you need to allocate something over the pool limit, and falling back to unmanaged memory (with a SafeHandle to protect against failure to Dispose) would be a big improvement there. Those extra large allocations living until the next full GC can be a killer. |
We should not use it, instead, we should always trim the pool by a certain percentage on every Gen2 GC. |
Subscribed! Very curious to see what the issue is. |
Very late to the party (super interesting conversation by the way!), just thought I'd throw this out there too as it seems somewhat relevant to the new issue Anton opened, and potentially useful? This is essentially meant to be a more reliable and less hacky way to get memory pressure based callbacks than what is achievable today by using |
@Sergio0694 yeah, seen that, cool stuff, thanks a lot for pushing it! |
I made some progress with a prototype on a branch. As a next step, I need to start some benchmarking to determine optimal parameters, however I'm not sure what the pool sizes are considered to be still "sane". It can be any value between a few megabytes, up to several gigabytes on many-core systems. Note that dotnet/runtime#55621 went very aggressive for Maybe we could define it as a percentage of |
Wow, the limit has been removed entirely! So what's you're current plan then? An admittedly very quick look at your branch yields to me what seems to be 3 pools now?
Isn't using the I feel like the |
@JimBobSquarePants it's still in a prototype state. Trimming is not yet implemented, first I want to figure out the optimal parameter sizes for the pool size and other things. Current WIP default is 256 MB, 48 MB is too low for stress cases. I think we should hide the implementing type ( public class MemoryAllocator
{
public static MemoryAllocator CreateDefault(
int maxContiguousBlockSizeInBytes = DefaultContiguousBlockSizeInBytes,
int totalPoolSizeInBytes = DefaultTotalPoolSizeInBytes);
} We can introduce utilities to help interop use cases, which we will likely break by scaling down the default contiguous buffer size (see #1307 and #1258): public static class SizeCalculator
{
public static GetImageSizeOf<T>(int width, int height);
}
// Customize the allocator for GPU interop use cases:
Configuration.Default.MemoryAllocator = MemoryAllocator.CreateDefault(
maxContiguousBlockSizeInBytes: SizeCalculator.GetImageSizeOf<Rgba32>(4096, 4096)); |
Yep. I like all these ideas! |
@JimBobSquarePants @saucecontrol I was already at final steps polishing my implementation, when I discovered an issue with the The following graph shows the committed memory for an experiment which attempts to free all the memory in the end: For comparison here is the same experiment, with a slight alteration: setting What to do now?We can do better job by getting rid of GC for large buffers entirely, and refactoring To deliver something to the users in the meanwhile, we can also consider to PR an unmanaged allocator without any pooling. From what I see now in large scale stress benchmarks, |
Oh wow! Interesting and slightly disappointing result. I'm glad you've shown your level of thoroughness. I would have likely missed something like this. A few more weeks is no problem. It gives us time to review and merge the WebP code and fix a few more issues. Plus I'm trying to focus on Fonts and Drawing right now to get them closer to RC status. |
@jaddie there is a WIP PR #1730, and an API proposal in #1739, discussions shifted there. The required change is so fundamental, that we decided to do a major version bump, so our next release containing the memory fix will be ImageSharp 2.0. The memory PR is expected to be finished within 1 or 2 months (not that much work left, but I'm doing it in my free time), however we also want to include WebP in that release, which may prolong things too. Hopefully we will be still able to deliver 2.0 before the end of 2021. If you are considering to help: I need users with production-like environments to test and validate #1730. Let me know if you are interested. |
@antonfirsov With regard to the issue that LOH cannot decommit memory, have you tried the GCSettings.LargeObjectHeapCompactionMode Property? |
Yes, the second graph in #1596 (comment) is made with The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact. I'm not excluding that it can be a lesser evil in our case than going unmanaged and sacrificing the simplicity of our pixel manipulation APIs as a result (see #1739), but I can't commit to such a decision without consulting a GC expert. @cshung if you think you have some time to chat about this, let me know, I would really appreciate the help! |
Not compacting the LOH automatically by default is a bit sad. Historically, we never compacted the LOH, some customers depended on that fact and assume LOH allocated objects are pinned, so we cannot automatically compact the LOH without some kind of user opt-in, otherwise we might break some users. The Those days are long gone, now we can automatically compact the LOH with minimal configuration. Starting with .NET 6, you can specify the Unfortunately, the setting is not documented yet, it will be. For the time being, we can explore what the setting could do by looking at the code here(*). If you search for This should alleviate the need to set that property - whether or not to compact the LOH is best left for the GC to decide. (*) Whatever we actually do in the code is an implementation detail that is subjected to change, we do need the flexibility to avoid painting ourselves into a corner, like what we did with LOH not compacting. |
I am more than happy to reach out to developers like you who care about garbage collection performance. My personal goal is to understand the needs and ideally come up with benchmarks that are representative to work on. What is the best way to reach you? |
@cshung thanks a lot for the answers!
From our perspective the problem with an opt-in setting is that we can't configure it on behalf of our users. Even if we would promote it in our documentation, most users would still miss it, and complain about poor scalability compared to unmanaged libraries like skia(sharp). It's much better if the library "just works" without any extra configuration thanks to good defaults. For now I decided to go on with our switching to unmanaged memory because of two reasons:
However, thinking in longer terms, this feels wrong to me. Ideally, a managed library should be able to meet all of it's requirements using managed memory only. Would be cool to switch back to GC in a future version. I wonder if ImageSharp is some sort of special animal here, or are there other memory-heavy libraries or apps facing similar issues.
That's great to hear, we can chat on Teams I think. |
One way to address the concerns in #1590 is to come up with a new
MemoryAllocator
that meets the following two requirements even under very high load:(A) Steady allocation patterns, instead of GC fluctuations
(B) Lower amount of memory being retained, at least "after some time"
Some ideas to explore:
1.1 Consider pooling unmanaged memory, especially if we implement the next point.
Point 1. seems to be very simple to prototype. We need an allocator that uses
Marshal.AllocHGlobal
over some threshold, and anArrayPool
below it, and see how the memory timeline goes with the bee heads MemoryStress benchmark in comparison toArrayPoolMemoryAllocator
.@saucecontrol any thoughts or further ideas? (Especially on point 2.)
The text was updated successfully, but these errors were encountered: