More efficient MemoryAllocator #1596

antonfirsov · 2021-04-13T15:37:16Z

One way to address the concerns in #1590 is to come up with a new MemoryAllocator that meets the following two requirements even under very high load:
(A) Steady allocation patterns, instead of GC fluctuations
(B) Lower amount of memory being retained, at least "after some time"

Some ideas to explore:

Allocate native memory over a certain threshold (~1MB) deferring the memory management to the OS
1.1 Consider pooling unmanaged memory, especially if we implement the next point.
Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream. It's worth to check how does this work with both pooled arrays and unmanaged memory.
Have a threshold on the reatained memory, and release some of the pooled arrays (or pooled unmanaged buffers) when we grow over it.
Mentioning an idea I really hate for the sake of completeness: Have a synchronization mechanism around large buffer acquisition, similarly to System.Drawing (WIC?). To do this properly, memory allocation should become an asynchronous operation, otherwise SpinLocks will just add more fuel to the fire.

Point 1. seems to be very simple to prototype. We need an allocator that uses Marshal.AllocHGlobal over some threshold, and an ArrayPool below it, and see how the memory timeline goes with the bee heads MemoryStress benchmark in comparison to ArrayPoolMemoryAllocator.

@saucecontrol any thoughts or further ideas? (Especially on point 2.)

The text was updated successfully, but these errors were encountered:

saucecontrol · 2021-04-13T19:35:46Z

Idea 2 is interesting. I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

For the record, I also hate idea 4, but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts. The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine... I reckon even in 64-bit you'd start paging.

antonfirsov · 2021-04-13T20:10:02Z

I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

Implementing the ability to have discontinous buffers was relatively easy, since our processing code usually goes row-by row. We just had upgrade buffer2d.GetRowSpan(y) so it can take the row from different buffers (buffer boundary is always between rows). Going fully-streamed needs much bigger architectural changes, I would rather start with the "decode into target size" first, and see where we can get from there.

Currently we have a very high limit for images being split to different buffers, what I'm considering is to check what happens if we aggressively lower that limit. I'm unable to assess if it would make thing worse or better.

but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts.

That's why I want to avoid materializing them :) Will open an issue for tracking "decode into target size" stuff tomorrow.

The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine...

I was wondering why do are using the basic Parallel.For overload that scales parallelism with the machine's CPU count? Is this a way of modeling typical server-side multi-threaded stress?

saucecontrol · 2021-04-13T21:05:57Z

I'm unable to assess if it would make thing worse or better.

I don't think it would make things worse. If I understand right, your current mode of processing is to apply each transform to the entire image before moving on the next? If so, you're kind of trashing the cache at each stage, so it wouldn't matter whether the memory is contiguous or not. I'm not sure it would make things better either, though. You might want to look at your LOH fragmentation under those high memory conditions and see whether being able to stuff a smaller buffer into an LOH hole would gain you anything. Switching to unmanaged buffers may help a bit there, but heap fragmentation can be a problem in unmanaged as well.

Of course being able to aggregate small buffers in place of a single larger one would be of benefit, provided you don't need them at the same time. I assume if that works for RecyclableMemoryStream, it should work for ImageSharp 🤷‍♂️

Is this a way of modeling typical server-side multi-threaded stress?

I must admit I didn't put that much thought into it. But yeah, the point was to make sure there were enough threads running to check whether the libraries were capable of true linear scale-up. As long as there are free processors to hide work on (as Vips does today -- and ImageSharp did before your windowed convolution implementation), you can't be sure. If you think about a high-load web server scenario, you would assume the managed thread pool would fully occupy all processors, so I want to know how a library will perform in that environment -- and to be sure that a large number of image processing operations doesn't make for a self-DoS.

BTW, I suspect the anomalous CPU measurements I got for Vips running under the profiler might have actually come from an oversubscription problem. Could be the CPU overhead of the profiler was just enough to throw the whole system into chaos and drop the speed by several multiples. I plan on doing some followup testing to see if I can nail that down.

saucecontrol · 2021-04-13T21:25:52Z

Couple of other interesting stress cases you might want to look at.

3400+ frame GIF: http://i.imgur.com/Xv4UKYL.gif
Lots of good giant JPEGs available on the ESO site, like: https://www.eso.org/public/usa/images/eso1719a/

antonfirsov · 2021-04-16T12:41:31Z

The reason I'm thinking about "discontiguous memory by default" is not cache coherence, but the fact that it may enable more efficient pooling, reducing the GC pressure (or OS allocation traffic in case of unmanaged buffers). With large images it is very likely that a the requested buffer will outgrow the pool deferring to GC/OS allocating new giant arrays of non-uniform sizes, which would probably result in heap fragmentation even if we are using unmanaged memory.

I'm thinking of a solution that works as following:

Under 1MB we use a standard ArrayPool (can be ArrayPool<byte>.Shared)
Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.
- This pool might be built of LOH arrays or unmanaged buffers. (Both approaches seem to have pros and cons.)
- Obviously there is an upper limit for the retained/pooled memory
If we outgrow the discontiguous pool, we may consider to defer new allocations to unmanaged memory, so at least we don't stress the GC
There is a logic that periodically shrinks the pool so the solution works well with ImageSharp.Web (which typically places much lower load on the core library after a warmup period, so it doesn't make sense to retain memory in the pools)
- The periods can be synchronized with actual GC calls with a trick similar to Gen2GcCallback in TlsOverPerCoreLockedStacksArrayPool. For example we may halve the retained memory before each gen2 GC.

@saucecontrol any thoughts on this particular plan? I wish I could pull in a GC expert to assess it ...

saucecontrol · 2021-04-16T19:23:47Z

That plan looks sound to me, but I'm no GC expert. I do expect that inability to grab a contiguous large block of memory due to LOH (or process heap) fragmentation is at play in the 32-bit OOM scenario, but it would be good to verify that before diving in to the work if possible.

It seems like if you end up doing the work to completely manage the allocations, you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that. Plus, you'll probably always have to deal with some allocation requests that will be over whatever pooling threshold you set, and you'll want those to go straight to unmanaged so they are as short-lived as possible.

antonfirsov · 2021-04-16T23:37:59Z

you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that

The problem is that I don't really know when to free up memory. With LOH + the Gen2Callback trick I can at least rely on GC heuristics with this. (Otherwise it feels like implementing my own GC)

JimBobSquarePants · 2021-04-17T00:04:44Z

Is this useful?

https://github.com/sakno/dotNext/tree/567ac01b54783a2cb9ef167c6fe603dcb42235f5/src/DotNext.Unsafe/Buffers

antonfirsov · 2021-04-19T13:10:18Z

@JimBobSquarePants it just allocates / returns unmanaged memory without pooling. Maybe it's good enough (or even preferable) approach with unmanaged memory, not sure.

My main concern about unmanaged memory is that I would be hesitant to use it as a default, because of the users who may implicitly depend on not disposing Image. An update would introduce a reliability / security issue for them instead of their existing performance issue.

saucecontrol · 2021-04-19T19:39:06Z

Instead of holding the raw IntPtr as a member of the MemoryManager<T>, that should hold a SafeHandle wrapping it. Something like:

internal sealed class SafeHGlobalHandle : SafeHandle
{
    private readonly int byteCount;

    public override bool IsInvalid => handle == IntPtr.Zero;

    public SafeHGlobalHandle(int size) : base(IntPtr.Zero, true)
    {
        SetHandle(Marshal.AllocHGlobal(size));
        GC.AddMemoryPressure(byteCount = size);
    }

    protected override bool ReleaseHandle()
    {
        if (IsInvalid)
            return false;

        Marshal.FreeHGlobal(handle);
        GC.RemoveMemoryPressure(byteCount);
        handle = IntPtr.Zero;

        return true;
    }
}

Since SafeHandle has a finalizer, it won't leak, but you can call Dispose to free it immediately if the MemoryManager<T> is Disposed properly.

It might be worth implementing something like that for any allocations larger than your max pool buffer size just because you know those will happen and will have a large GC cost. It's also a lot easier to trial than the segmented buffer idea.

saucecontrol · 2021-05-03T16:21:16Z

FYI dotnet/runtime#52098 is tracking some possible scenarios to improve in ArrayPool/GC interaction. I think Jan's scenarios 1 and 3 (and maybe 2 as well) are applicable to ImageSharp.

antonfirsov · 2021-05-03T16:57:59Z

There are so many things I hate about ArrayPool, especially ArrayPool<T>.Shared which is a library-level public shared global state, and as such something that shouldn't exist at all IMO.

All the magic ArrayPool.Shared is doing should be a runtime/GC implementation detail hidden behind standard new T[] allocations. (Edit: OK, it was mostly ranting, this approach would require an API like GC.UnsafeReleaseArray(...), but but AFAIK array pooling is much less common in Java for example, I assume GC does better job there)

br3aker · 2021-05-11T03:45:36Z

Hi everybody!
Have a couple of thoughts on this project I'd like to share. Thanks for such a good topic on memory performance.

First of all, current state of pooling already supports discontiguous buffers. It's just hidden behind MemoryGroup<T>.Allocate(,,,) call:

First it checks if single stride can fit in single contiguos buffer:

ImageSharp/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup{T}.cs

Line 85 in a8cb711

    
           int blockCapacityInElements = allocator.GetBufferCapacityInBytes() / ElementSize;

Then it does strange things:

ImageSharp/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup{T}.cs

Lines 99 to 107 in a8cb711

    
           int numberOfAlignedSegments = blockCapacityInElements / bufferAlignment; 
        
           int bufferLength = numberOfAlignedSegments * bufferAlignment; 
        
           if (totalLength > 0 && totalLength < bufferLength) 
        
           { 
        
               bufferLength = (int)totalLength; 
        
           } 
        
           int sizeOfLastBuffer = (int)(totalLength % bufferLength); 
        
           long bufferCount = totalLength / bufferLength;

Details aside - this alloc call either throws or pools single contiguous buffer each time. So here goes first problem: most of the time pool is stressed by huge arrays from buckets of higher end of indices when low indices are empty which makes maxArraysPerBucket config value completely irrelevant.
Edit: this can't be fixed by simply rewriting MemoryGroup<T>.Allocate(,,,) call, at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption. Haven't checked other processors but I guess it might also be a problem somewhere else.

Edit: huge mistake in calculations and hardcoded upper bound of pool max size (it's hex, not dec), this still implies but for certain pool setups, see post below.
But there's a far more complicated problem: pooling doesn't work in certain situations. This is solely due to current implementation of the pool in .net, it silently clamps maximum size of the array:

const int MinimumArrayLength = 0x10, MaximumArrayLength = 0x40000000;
if (maxArrayLength > MaximumArrayLength)
{
    maxArrayLength = MaximumArrayLength;
}

Which leads to unexpected LOH allocations when working with images up to ~40mb of pixel data and that isn't very rare:
rgb24 1080p jpeg is 24 * 1920 * 1080 = ~49mb of decoded pixel data.
Bee heads test completely ignores pooling because it consists of rgb24 jpeg`s with at least 3000x3000 pixels.

Parallel.Foreach bee heads can skyrocket memory usage (been counting gigabytes of work set during execution) with frequent GC(2) collections.

saucecontrol · 2021-05-11T04:46:37Z

You've got a couple of translation errors in your analysis. First, the hard-coded MaximumArrayLength = 0x40000000 isn't 40MB -- it's hex, so that's 1GB. Second, RGB24 is 24 bits, or 3 bytes, per pixel. So a decoded 1920x1080 RGB image is 6MB.

But you're right about the LOH allocations. The default memory pool is limited to 24MB per array, so images over ~8 megapixels will cause allocations, and there are definitely images in the bee heads set that exceed that limit.

ImageSharp/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.CommonFactoryMethods.cs

Line 15 in a8cb711

internal const int DefaultMaxPooledBufferSizeInBytes = 24 * 1024 * 1024;

The challenge is in achieving a balance between avoiding GC allocations and holding a high steady-state memory level when large arrays get pooled.

br3aker · 2021-05-11T05:21:37Z

Oh gosh, you're right, math was not on my side tonight, thanks for pointing out.

Worst part about built-in .net pool is that it was developed for general use-case. Exponential growth of internal buffers would waste a lot of space in certain conditions. Empty buckets are not a big deal but they are just sit there for no reason.
Pixel buffer consisting of several unmanaged chunks is a much better choice, plus custom alignment and more reliable memory deallocation.

Although GC.AddMemoryPressure looks counterproductive in this particular situation - we don't want to affect GC at all while using pool.

br3aker · 2021-05-11T06:11:38Z

IMO all allocations should be divided into 2 categories: pixel data which outlives any other allocation type in context of this library and everything else. Main memory consumers are pixel buffers which cause massive hits on both GC heaps and theoretical custom memory pool based on unmanaged memory - it'll be much harder to support large buffers while allowing to pool small buffers of integers/small structs.

This way we can delegate those small allocations to GC (thx to tiers this won't really affect its performance) while concentrating on storing pixel buffers with proper alignment.

Another interesting concept mentioned by @antonfirsov - async/await for memory pooling. This approach could facilitate environments where memory is crucial. But it might be really tricky not to fall into the deadlock when there won't be enough memory for any of the awaiters.
Anyway, I don't think that it's a good solution to avoid locks in pooling logic - some cases simply don't want to work with TPL - calling something like allocator.GetMemoryAsync(size).Result looks ugly and performs even worse than it looks.
Asynchronous pooling can use separate memory pool so it won't interfere with synchronized pooling. And I still don't think it's worth spending time implementing this atm, better to concentrate on basic pool implementation.

antonfirsov · 2021-05-11T09:46:33Z

@br3aker reacting to #1596 (comment) first:

this alloc call either throws or pools single contiguous buffer each time.

at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption

If any of these are true, it's a bug. You should be able to limit the maximum contiguous buffer size by using the following full ArrayPoolMemoryAllocator constructor, and setting bufferCapacityInBytes:

ImageSharp/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs

Lines 77 to 90 in a8cb711

    
           /// <summary> 
        
           /// Initializes a new instance of the <see cref="ArrayPoolMemoryAllocator"/> class. 
        
           /// </summary> 
        
           /// <param name="maxPoolSizeInBytes">The maximum size of pooled arrays. Arrays over the thershold are gonna be always allocated.</param> 
        
           /// <param name="poolSelectorThresholdInBytes">The threshold to pool arrays in <see cref="largeArrayPool"/> which has less buckets for memory safety.</param> 
        
           /// <param name="maxArraysPerBucketLargePool">Max arrays per bucket for the large array pool.</param> 
        
           /// <param name="maxArraysPerBucketNormalPool">Max arrays per bucket for the normal array pool.</param> 
        
           /// <param name="bufferCapacityInBytes">The length of the largest contiguous buffer that can be handled by this allocator instance.</param> 
        
           public ArrayPoolMemoryAllocator( 
        
               int maxPoolSizeInBytes, 
        
               int poolSelectorThresholdInBytes, 
        
               int maxArraysPerBucketLargePool, 
        
               int maxArraysPerBucketNormalPool, 
        
               int bufferCapacityInBytes)

~~You can pass the default values for the rest of the parameters.~~ -- Except bucket sizes, you need to increase them significantly to make sure the pool is not drained.

Note that the default for bufferCapacityInBytes is a very high value today (equivalent of ~130 Megapixels of RGBA32), so this may explain that you see contiguous buffers without customization.

I believe lowering the capacity + setting a higher bucket size should help with the biggest "pooling doesn't work in certain situations" problem. If you have a repro proving this fails on the load+resize+save path (= contiguous buffer is allocated or exception is thrown) can you please open a new separate issue for that?

antonfirsov · 2021-05-11T09:55:16Z

Regarding the rest of the comments:

I believe all mentioned issues can be addressed either by a very basic unmanaged allocator or by the "smart" allocator described in #1596 (comment). None of those would use the BCL ArrayPool<T> for large buffers, and therefore won't suffer from the asymmetric bucket usage problem.

Main memory consumers are pixel buffers

I wish this was the case, but this is not true: #1597. This our top-priority memory issue now, effecting the most common load+resize+save path. Even if we manage to fully optimize it, the "fancy" (blur, sharpen etc.) processors may still keep allocating all kinds of large Buffer2D<T>-s of Width * Height size. I don't see a conceptual difference between pixel buffers backing Image<T> and other internal buffers.

br3aker · 2021-05-11T14:57:45Z

Oh yeah, I've haven't actually checked all the processors. What I meant is that allocator shouldn't be bothered with small allocations like here:

ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs

Lines 497 to 507 in a8cb711

    
           this.frequenciesMemoryOwner = memoryAllocator.Allocate<short>(elements); 
        
           this.frequenciesMemoryHandle = this.frequenciesMemoryOwner.Memory.Pin(); 
        
           this.Frequencies = (short*)this.frequenciesMemoryHandle.Pointer; 
        
           this.lengthsMemoryOwner = memoryAllocator.AllocateManagedByteBuffer(elements); 
        
           this.lengthsMemoryHandle = this.lengthsMemoryOwner.Memory.Pin(); 
        
           this.Length = (byte*)this.lengthsMemoryHandle.Pointer; 
        
           this.codesMemoryOwner = memoryAllocator.Allocate<short>(elements); 
        
           this.codesMemoryHandle = this.codesMemoryOwner.Memory.Pin(); 
        
           this.codes = (short*)this.codesMemoryHandle.Pointer;

Those buffers are not actually big:

ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs

Lines 64 to 66 in a8cb711

    
           this.literalTree = new Tree(memoryAllocator, LiteralNumber, 257, 15); 
        
           this.distTree = new Tree(memoryAllocator, DistanceNumber, 1, 15); 
        
           this.blTree = new Tree(memoryAllocator, BitLengthNumber, 4, 7);

ImageSharp/src/ImageSharp/Formats/Png/Zlib/DeflaterHuffman.cs

Lines 19 to 26 in a8cb711

    
           // The number of literal codes. 
        
           private const int LiteralNumber = 286; 
        
           // Number of distance codes 
        
           private const int DistanceNumber = 30; 
        
           // Number of codes used to transfer bit lengths 
        
           private const int BitLengthNumber = 19;

Without proper investigation I thought several places had this kind of small allocations but I was wrong, that's the only place I could find with allocator.Allocate call with small buffers (compared to other allocations). My bad, sorry for the rustle.

#1596 (comment) I changed bucket size prior to experimenting with non-contiguos buffers, sorry for not mentioning that. Will try to reproduce and write proper issue/comment today.

hexawyz · 2021-05-12T18:37:12Z

Hello,

I saw this issue pop in my twitter feed and I was curious about it, I hope you'll forgive me for butting in 👀

I didn't see it mentioned, but at least in the scenario of very large pixel buffers, an option could be to rely on MemoryMappedFile for allocating unmanaged memory in the form of raw memory pages.
As an alternative to Marshal.AllocHGlobal, this would avoid fragmenting the native heap (used by AllocHGlobal) and the LOH, at least in a 64 bit process. In a 32 bit process, address space fragmentation could still be a problem.

Similary to AllocHGlobal/FreeHGlobal, the memory you get from MemoryMappedFile is allocated and deallocated immediately.
However, the memory pages used by a memory mapped files should be freed pages immediately, where an heap/LOH based approach could keep the pages in memory. (e.g. because they are in the middle of the heap or still partially used)

⚠ One possible drawback could be a performance hit associated with allocating/deallocating the memory pages for the memory mapped file (this would need to be benchmarked), but this should still be worthwile for very large buffers, as a heap-based mechanism would also need to allocate new pages.

Of course, this way of allocating memory could also be used with pooling and discontinous buffers. What it does at the core is give you more control over how much physical memory is allocated, or at least reserved in your address space.
It could be worthwile if you were to chose to pool your own unmanaged buffers.

For the sake of completeness, you can find a rough example of usage of MemoryMappedFile to allocate private memory here.

JimBobSquarePants · 2021-06-11T20:10:08Z

Regarding discontigous buffers. If we went that route we would likely not be able to pin anything. There's a couple of places we do so currently.

JimBobSquarePants · 2021-06-11T20:19:07Z

@kkokosa @Maoni0 My sincerest apologies for the name drop but since you are both absolute experts in memory management and the GC it would be wonderful if we could have your opinion on the best design approach.

antonfirsov · 2021-06-12T13:06:02Z

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

@GoldenCrystal using MemoryMapped files is a good idea, but it's the most complex option of all, needs a lot of research, and has to be implemented with care. If possible, I would prefer to go with managed LOH arrays, because anything unmanaged is a breaking change (an application leaking IDisposable Image-s updates to new version of the library introducing a security risk by actually starting to leak unmanaged memory).

antonfirsov · 2021-06-12T13:06:07Z

In case anyone chooses to chime in, but the thread is TLDR:

I'm thinking about a custom pool of uniform (~2MB) LOH arrays, that can be used build larger discontiguous buffers to back ImageSharp's Image<TPixel> and other very large internal temporary "2D" buffers.

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?
Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

JimBobSquarePants · 2021-06-12T15:10:28Z

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

Yep but this had me concerned.

Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream.

Regarding GetBufferCapacityInBytes. ArrayPoolMemoryAllocator currently uses this as more than a hint and throws an InvalidMemoryOperationException if the requested 1D buffer is larger than BufferCapacityInBytes which I'm not currently a fan of.

br3aker · 2021-06-12T15:12:01Z

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?

Honestly, current problems with OutOfMemory exceptions look like a bug rather than real memory management problems. I can't believe that even in parallel tests 8 (or even 16) cores can spend up to 8gb in its peak scenario on such a simple task (source - beeheads demo by @saucecontrol)

Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

One of the simpliests yet the most 'controlable' thing from user perspective you can do (especially good for bulk processing):

using (var context = new Context()) 
{
    using Image img = Image.Load(..., context);
    using Image resized = img.Resize(..., context);
    resized.Save(..., context);
}

Unfortunately, this would lead to a lot of public API changes/additions but it allows to manually control when processing is over and resources can be freed.

Problem with LOH allocations is that LOH can't actually trigger GC as reliably as non-LOH allocations do. We have 2 rather rare scenarions of LOH collection:
1. Low available memory to allocate non-LOH memory -> full GC collection cycle for freeing all memory
2. Low available memory to allocate new LOH buffer for pooling -> full GC collection cycle as gen2 triggers both gen0 and gen1

(1) won't be caused by pooling if we use LOH buffers
(2) pool won't allocate new buffers after some time - it would either use buffers 100% for some processing or would have a lot of free buffers

What I wanted to state is that LOH collection is a really rare occasion unless triggered manually which is not what library code can reliably control.

Yes, we can try to force gen2 GC collection at pool buffer returnal:

// somewhere in pool class
public void Return(byte[] buffer) {
    // buffer returnal logic

    usedBuffersCount--;
    if(usedBuffersCount / allocatedBuffersCount < somePercentage) {
        GC.Collect(2);
    }
}

But this should be done with care as gen2 collection could cause a significant performance hit.

JimBobSquarePants · 2021-06-13T06:07:40Z

Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator nor ArrayPoolMemoryAllocator locks in rent/returnal code which can lead to many different entities renting same buffer.

@br3aker They're the same thing if set which is the default behavior. Locking during Rent/Return occurs in the underlying ConfigurableArrayPool<T>.

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool<T> that would be impossible.

br3aker · 2021-06-13T07:18:12Z

@JimBobSquarePants I must be blind, thanks for clarifying about locks, didn't see that _lock.Enter(...) call at first glance.

Yep, new solution would require entirely new pool implementation which I'm willing to try in the upcoming week(s). I'll open a draft if anything works out.

antonfirsov · 2021-06-13T09:48:44Z

@br3aker on .NET Core there is a trick to detect if a gen2 GC is happening:

https://github.com/dotnet/runtime/blob/01b7e73cd378145264a7cb7a09365b41ed42b240/src/libraries/System.Private.CoreLib/src/System/Gen2GcCallback.cs

What I would do is to trim some of the retained buffers on every gen2 GC (eg. 1/2 or 3/4 or similar). I think this would handle the cases you pointed out in #1596 (comment) without complex statics.

Note that the contextual batching API you referred to in #1596 (comment) already exists: you can create an pass around a Configuration with it's own MemoryAllocator and call configuration.MemoryAllocator.ReleaseRetainedResources(). We need something that works by default.

br3aker · 2021-06-13T15:45:28Z

@antonfirsov yeah I understand about "better memory handling by default", contextual batching api idea was more of a feature around new built-in allocator. Current API is a bit misleading with creating a new allocator and then manually calling ReleaseRetainedResources() but that's an another day discussion, sorry for the confusion in this already complicated topic.

JimBobSquarePants · 2021-06-13T16:41:15Z

Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.

@antonfirsov Would the 2MB uniform limit have any affect upon our ability to work with wrapped buffers. e.g SwapOrCopy. My memory is hazy there.

antonfirsov · 2021-06-13T17:11:59Z

SwapOrCopy is now implemented on the MemoryGroup level, so it should work fine:

ImageSharp/src/ImageSharp/Memory/DiscontiguousBuffers/MemoryGroup{T}.cs

Lines 179 to 198 in 97dde7f

    
           public static bool SwapOrCopyContent(MemoryGroup<T> target, MemoryGroup<T> source) 
        
           { 
        
               if (source is Owned ownedSrc && ownedSrc.Swappable && 
        
                   target is Owned ownedTarget && ownedTarget.Swappable) 
        
               { 
        
                   Owned.SwapContents(ownedTarget, ownedSrc); 
        
                   return true; 
        
               } 
        
               else 
        
               { 
        
                   if (target.TotalLength != source.TotalLength) 
        
                   { 
        
                       throw new InvalidMemoryOperationException( 
        
                           "Trying to copy/swap incompatible buffers. This is most likely caused by applying an unsupported processor to wrapped-memory images."); 
        
                   } 
        
                   source.CopyTo(target); 
        
                   return false; 
        
               } 
        
           }

saucecontrol · 2021-06-13T19:03:34Z

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool that would be impossible.

This would seem to be the way in the short term. The way it trims in response to Gen2GcCallback is hacky but clever and well-tested. You'd want to simplify it to not hold a bucket per size per thread since we're talking about larger buffers, though. Maybe some hybrid between BCL's TlsOverPerCoreLockedStacksArrayPool and ConfigurableMemoryPool.

I think regardless, you'll end up with times you need to allocate something over the pool limit, and falling back to unmanaged memory (with a SafeHandle to protect against failure to Dispose) would be a big improvement there. Those extra large allocations living until the next full GC can be a killer.

antonfirsov · 2021-07-03T15:39:10Z

GC.GetGCMemoryInfo() doesn't look like a reliable thing: dotnet/runtime#55126

We should not use it, instead, we should always trim the pool by a certain percentage on every Gen2 GC.

JimBobSquarePants · 2021-07-04T00:48:37Z

Subscribed! Very curious to see what the issue is.

Sergio0694 · 2021-07-04T16:32:08Z

Very late to the party (super interesting conversation by the way!), just thought I'd throw this out there too as it seems somewhat relevant to the new issue Anton opened, and potentially useful?

dotnet/runtime#53895

This is essentially meant to be a more reliable and less hacky way to get memory pressure based callbacks than what is achievable today by using Gen2Callback that Clint mentioned 😄

antonfirsov · 2021-07-04T18:31:52Z

@Sergio0694 yeah, seen that, cool stuff, thanks a lot for pushing it!

antonfirsov · 2021-07-24T15:21:39Z

I made some progress with a prototype on a branch. As a next step, I need to start some benchmarking to determine optimal parameters, however I'm not sure what the pool sizes are considered to be still "sane".

It can be any value between a few megabytes, up to several gigabytes on many-core systems. Note that dotnet/runtime#55621 went very aggressive for ArrayPool<T>.Shared.

Maybe we could define it as a percentage of TotalAvailableMemoryBytes? (However it won't work well automatically with stuff like docker's --memory switch.)

JimBobSquarePants · 2021-07-25T13:03:43Z

Wow, the limit has been removed entirely!

So what's you're current plan then? An admittedly very quick look at your branch yields to me what seems to be 3 pools now?

ArrayPool<byte>.Shared <= 1MB
UniformByteArrayPool - Which doesn't appear to do any cleanup? <= 48 MB
UnmanagedMemoryAllocator > 48MB

Isn't using the UniformByteArrayPool in this manner going to lead to the same issue as we are seeing where we hold on to memory long after it is useful?

I feel like the DefaultMemoryAllocator is currently offering too many options in its constructor and that users will find them confusing (I find the two contiguous options confusing and I know how the library works). If someone wants to wrap memory or cares about contiguous buffer length perhaps they should simply switch to the SimpleGcMemoryAllocator or roll their own?

antonfirsov · 2021-07-25T15:30:13Z

@JimBobSquarePants it's still in a prototype state. Trimming is not yet implemented, first I want to figure out the optimal parameter sizes for the pool size and other things. Current WIP default is 256 MB, 48 MB is too low for stress cases.

I think we should hide the implementing type (DefaultMemoryAllocator or whatever) and make the public API implementation-agnostic this time:

public class MemoryAllocator
{
    public static MemoryAllocator CreateDefault(
        int maxContiguousBlockSizeInBytes = DefaultContiguousBlockSizeInBytes,
        int totalPoolSizeInBytes = DefaultTotalPoolSizeInBytes);
}

We can introduce utilities to help interop use cases, which we will likely break by scaling down the default contiguous buffer size (see #1307 and #1258):

public static class SizeCalculator
{
    public static GetImageSizeOf<T>(int width, int height);
}

// Customize the allocator for GPU interop use cases:

Configuration.Default.MemoryAllocator = MemoryAllocator.CreateDefault(
    maxContiguousBlockSizeInBytes: SizeCalculator.GetImageSizeOf<Rgba32>(4096, 4096));

JimBobSquarePants · 2021-07-25T15:34:57Z

Yep. I like all these ideas!

antonfirsov · 2021-08-02T11:54:02Z

@JimBobSquarePants @saucecontrol I was already at final steps polishing my implementation, when I discovered an issue with the UniformByteArrayPool approach. Looks like the GC is unable to decommit it's LOH pages despite trimming UniformByteArrayPool a couple of times (or even clearing it entirely with a call to .ReleaseRetainedResources()) if the pages are even slightly interleaved with other LOH allocations (typically coming from ArrayPool.Shared).

The following graph shows the committed memory for an experiment which attempts to free all the memory in the end:

For comparison here is the same experiment, with a slight alteration: setting GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce before GC.Collect(), showing that after compacting the LOH the memory is decommitted:

What to do now?

We can do better job by getting rid of GC for large buffers entirely, and refactoring UniformByteArrayPool into a pool of unmanaged buffers, but this could possibly prolong the PR for a couple of weeks.

To deliver something to the users in the meanwhile, we can also consider to PR an unmanaged allocator without any pooling. From what I see now in large scale stress benchmarks, UniformByteArrayPool adds about 5% to the throughput in comparison to a non-pooling unmanaged allocator, but a non-pooling unmanaged allocator has already 9% better throughput than our current ArrayPoolMemoryAllocator, and much better memory characteristics.

JimBobSquarePants · 2021-08-02T12:44:28Z

Oh wow! Interesting and slightly disappointing result. I'm glad you've shown your level of thoroughness. I would have likely missed something like this.

A few more weeks is no problem. It gives us time to review and merge the WebP code and fix a few more issues. Plus I'm trying to focus on Fonts and Drawing right now to get them closer to RC status.

antonfirsov · 2021-10-06T13:40:11Z

@jaddie there is a WIP PR #1730, and an API proposal in #1739, discussions shifted there. The required change is so fundamental, that we decided to do a major version bump, so our next release containing the memory fix will be ImageSharp 2.0.

The memory PR is expected to be finished within 1 or 2 months (not that much work left, but I'm doing it in my free time), however we also want to include WebP in that release, which may prolong things too. Hopefully we will be still able to deliver 2.0 before the end of 2021.

If you are considering to help: I need users with production-like environments to test and validate #1730. Let me know if you are interested.

cshung · 2021-10-29T07:18:09Z

@antonfirsov With regard to the issue that LOH cannot decommit memory, have you tried the GCSettings.LargeObjectHeapCompactionMode Property?

antonfirsov · 2021-10-29T08:53:58Z

Yes, the second graph in #1596 (comment) is made with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce set in the end.

The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact. I'm not excluding that it can be a lesser evil in our case than going unmanaged and sacrificing the simplicity of our pixel manipulation APIs as a result (see #1739), but I can't commit to such a decision without consulting a GC expert.

@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!

cshung · 2021-11-04T01:20:47Z

The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact.

Not compacting the LOH automatically by default is a bit sad. Historically, we never compacted the LOH, some customers depended on that fact and assume LOH allocated objects are pinned, so we cannot automatically compact the LOH without some kind of user opt-in, otherwise we might break some users. The GCSettings.LargeObjectHeapCompactionMode property seems to solved most people's pressing issue and therefore we were not looking into it.

Those days are long gone, now we can automatically compact the LOH with minimal configuration. Starting with .NET 6, you can specify the GCConserveMemory settings. It is a number ranging from 0 to 9, indicating how much you want the GC to work to conserve memory instead of giving the best possible speed.

Unfortunately, the setting is not documented yet, it will be. For the time being, we can explore what the setting could do by looking at the code here(*).

If you search for compacting LOH due to GCConserveMem setting, you should find the relevant logic for deciding LOH compaction. As an overview, the implementation is checking the fragmentation ratio (i.e. how much percent of memory is wasted as free space). If it is larger than a threshold derived from the GCConserveMemory settings, then it will turn on LOH compaction.

This should alleviate the need to set that property - whether or not to compact the LOH is best left for the GC to decide.

(*) Whatever we actually do in the code is an implementation detail that is subjected to change, we do need the flexibility to avoid painting ourselves into a corner, like what we did with LOH not compacting.

cshung · 2021-11-04T01:23:10Z

@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!

I am more than happy to reach out to developers like you who care about garbage collection performance. My personal goal is to understand the needs and ideally come up with benchmarks that are representative to work on. What is the best way to reach you?

antonfirsov · 2021-11-04T13:29:17Z

@cshung thanks a lot for the answers!

Starting with .NET 6, you can specify the GCConserveMemory settings.

From our perspective the problem with an opt-in setting is that we can't configure it on behalf of our users. Even if we would promote it in our documentation, most users would still miss it, and complain about poor scalability compared to unmanaged libraries like skia(sharp). It's much better if the library "just works" without any extra configuration thanks to good defaults.

For now I decided to go on with our switching to unmanaged memory because of two reasons:

The compaction issues discussed above
Unmanaged memory has better characteristics when allocations are overflowing the pooling threshold, and the images are disposed immediately. (this instead of this)

However, thinking in longer terms, this feels wrong to me. Ideally, a managed library should be able to meet all of it's requirements using managed memory only. Would be cool to switch back to GC in a future version. I wonder if ImageSharp is some sort of special animal here, or are there other memory-heavy libraries or apps facing similar issues.

My personal goal is to understand the needs and ideally come up with benchmarks that are representative to work on. What is the best way to reach you?

That's great to hear, we can chat on Teams I think.

antonfirsov added needs triage area:performance labels Apr 13, 2021

JimBobSquarePants mentioned this issue Apr 20, 2021

Image resize performance when compared to other libraries #748

Closed

JimBobSquarePants mentioned this issue Jun 3, 2021

Jpeg encoding code optimization #1632

Merged

4 tasks

JimBobSquarePants mentioned this issue Jun 13, 2021

WIP. More Efficient MemoryAllocator #1660

Closed

4 tasks

JimBobSquarePants linked a pull request Jun 13, 2021 that will close this issue

WIP. More Efficient MemoryAllocator #1660

Closed

4 tasks

antonfirsov mentioned this issue Jul 3, 2021

Unexpected values reported by GC.GetGCMemoryInfo() dotnet/runtime#55126

Closed

antonfirsov mentioned this issue Jul 4, 2021

Add beeheads stress benchmarks to the repo #1687

Merged

4 tasks

antonfirsov mentioned this issue Aug 8, 2021

Unmanaged pooling MemoryAllocator #1730

Merged

10 tasks

antonfirsov mentioned this issue Oct 17, 2021

Proposal: GC.RegisterMemoryPressureCallback dotnet/runtime#53895

Open

antonfirsov closed this as completed in #1730 Dec 9, 2021

More efficient MemoryAllocator #1596

More efficient MemoryAllocator #1596

Comments

antonfirsov commented Apr 13, 2021

saucecontrol commented Apr 13, 2021

antonfirsov commented Apr 13, 2021

saucecontrol commented Apr 13, 2021

saucecontrol commented Apr 13, 2021

antonfirsov commented Apr 16, 2021 • edited Loading

saucecontrol commented Apr 16, 2021

antonfirsov commented Apr 16, 2021

JimBobSquarePants commented Apr 17, 2021

antonfirsov commented Apr 19, 2021

saucecontrol commented Apr 19, 2021

saucecontrol commented May 3, 2021

antonfirsov commented May 3, 2021 • edited Loading

br3aker commented May 11, 2021 • edited Loading

saucecontrol commented May 11, 2021 • edited Loading

br3aker commented May 11, 2021 • edited Loading

br3aker commented May 11, 2021

antonfirsov commented May 11, 2021 • edited Loading

antonfirsov commented May 11, 2021 • edited Loading

br3aker commented May 11, 2021 • edited Loading

hexawyz commented May 12, 2021

JimBobSquarePants commented Jun 11, 2021

JimBobSquarePants commented Jun 11, 2021

antonfirsov commented Jun 12, 2021

antonfirsov commented Jun 12, 2021

JimBobSquarePants commented Jun 12, 2021

br3aker commented Jun 12, 2021 • edited Loading

JimBobSquarePants commented Jun 13, 2021

br3aker commented Jun 13, 2021 • edited Loading

antonfirsov commented Jun 13, 2021

br3aker commented Jun 13, 2021 • edited Loading

JimBobSquarePants commented Jun 13, 2021

antonfirsov commented Jun 13, 2021

saucecontrol commented Jun 13, 2021

antonfirsov commented Jul 3, 2021

JimBobSquarePants commented Jul 4, 2021

Sergio0694 commented Jul 4, 2021

antonfirsov commented Jul 4, 2021

antonfirsov commented Jul 24, 2021 • edited Loading

JimBobSquarePants commented Jul 25, 2021

antonfirsov commented Jul 25, 2021

JimBobSquarePants commented Jul 25, 2021

antonfirsov commented Aug 2, 2021

What to do now?

JimBobSquarePants commented Aug 2, 2021

antonfirsov commented Oct 6, 2021

cshung commented Oct 29, 2021

antonfirsov commented Oct 29, 2021 • edited Loading

cshung commented Nov 4, 2021

cshung commented Nov 4, 2021

antonfirsov commented Nov 4, 2021

antonfirsov commented Apr 16, 2021 •

edited

Loading

antonfirsov commented May 3, 2021 •

edited

Loading

br3aker commented May 11, 2021 •

edited

Loading

saucecontrol commented May 11, 2021 •

edited

Loading

br3aker commented May 11, 2021 •

edited

Loading

antonfirsov commented May 11, 2021 •

edited

Loading

antonfirsov commented May 11, 2021 •

edited

Loading

br3aker commented May 11, 2021 •

edited

Loading

br3aker commented Jun 12, 2021 •

edited

Loading

br3aker commented Jun 13, 2021 •

edited

Loading

br3aker commented Jun 13, 2021 •

edited

Loading

antonfirsov commented Jul 24, 2021 •

edited

Loading

antonfirsov commented Oct 29, 2021 •

edited

Loading