Skip to content

Conversation

@antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Aug 8, 2021

UPDATE 2: Ready for review!

  • Although later than used to, but 32 bit still fails with OutOfMemoryException. Consider retrying Marshal.AllocHGlobal on OutOfMemoryException after a short wait. DONE: We are blocking the thread on OOM to retry allocations. 32 bit is 2x slower with 20 Threads than 64 bit, but doesn't OOM. The retries alone are not responsible for the 2x slowdown, 32bit runtime seems to work 1.5x slower also with 10 threads, when there are no OOMs.
  • Implement PreferContiguousImageBuffers, remove MemoryAllocator.MinimumContiguousBlockSizeBytes.
  • Implement PixelAccessor<T> and other Pixel processing breaking changes & API discussion #1739 stuff
  • Remove ArrayPoolMemoryAllocator
  • Implement some other minor API changes, eg. MemoryAllocator.Default. -- Need to change MemoryAllocator.Default, it should be get-only.
  • Fix memory corruption / resource leak bugs

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR introduces UniformUnmanagedMemoryPoolMemoryAllocator and sets it as default to fix #1596.

UniformUnmanagedMemoryPoolMemoryAllocator functional characteristics

  • For allocations below 1MB use ArrayPool<byte>.Shared
  • For allocations above 1MB use UniformUnmanagedMemoryPool to allocate 4 MB blocks of discontiguous unmanaged memory, of up to the pool's limit
  • Over pool limit, allocate unmanaged buffers split to discontiguous blocks of 32 MB. These will be freed immediately

Pool size

According to my benchmaks, the pool should scale to the maximum desired size to achieve the best througput. There is no point placing an artificial pool limit, unless there is a physical limitation. I decided to set the maximum pool size to 1/8 th of the available physical memory in 64 bit .NET Core processes. This means that on a 16GB machine the pool can grow as large as 2 GB.

On 32 bit, and other (non-testable) platforms the pool limit is 128 MB.

Trimming

The trimming of the pools is triggered by both Gen 2 GC collect and a timer. (We need the timer since unmanaged allocations don't trigger GC) On high load we trim the entire pool, on low load we trim 50% of the pool every minute.

Finalizers

With ArrayPoolMemoryAllocator, if an image is GC-d without being disposed, buffers are never returned to the pool. This means no hard memory leak, but the pools will be eventually exhausted, because the bucket's running index hitting the bucket limit.

To avoid this, MemoryGroup<T>.Owned and UniformUnmanagedMemoryPool.FinalizableBuffer<T> have finalizers returning the UnmanagedMemoryHandle to the pool. This can get tricky, since UnmanagedMemoryHandle is also finalizable:

/// <summary>
/// UnmanagedMemoryHandle's finalizer would release the underlying handle returning the memory to the OS.
/// We want to prevent this when a finalizable owner (buffer or MemoryGroup) is returning the handle to
/// <see cref="UniformUnmanagedMemoryPool"/> in it's finalizer.
/// Since UnmanagedMemoryHandle is CriticalFinalizable, it is guaranteed that the owner's finalizer is called first.
/// </summary>
internal void Resurrect()
{
GC.SuppressFinalize(this);
this.resurrected = true;
}
internal void AssignedToNewOwner()
{
if (this.resurrected)
{
// The handle has been resurrected
GC.ReRegisterForFinalize(this);
this.resurrected = false;
}
}

I'm moderately concerned about CA2015, but I don't think it applies to us. Dispose will also free the memory used by a span. Touching a span or a pointer to SkiaSharp image's memory would be also a bug if the image is finalized.

API changes

Resolves #1739
Fixes #1675

Edit: API changes implemented according to #1739.

Benchmarking methodology

To determine these defaults I compared results of LoadResizeSaveParallelMemoryStress runs systematically, typically running them for a varying parameter a couple of times, while fixing all other parameters. I have a bunch of Excel documents comparing the tables, including all of them would be TLDR, but I can present information on request.

Benchmark results

I was benchmarking on a 10 Core (20 Thread) 19-10900X with 64 GB RAM. This means I was able to stress highly parallel workload very extensive allocation pressure.

Here is an median processing time (seconds) of 40 runs of LoadResizeSaveParallelMemoryStress ("Classic" means ArrayPoolMemoryAllocator):

Classic Nopool Unmanaged Pool
9.038 8.988 8.306
100.56% 100.00% 92.41%

ImageSharp is about 8% faster with the new default memory allocator.

Results of LoadResizeSaveStressBenchmarks BDN benchmark also show 7.5% improvement:

Method maxDegreeOfParallelism Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing 20 3,007.7 ms 151.1 ms 8.28 ms 1.00 - - - 30 KB
ImageSharp_DefaultMemoryAllocator 20 682.6 ms 395.6 ms 21.69 ms 0.23 - - - 2,413 KB
ImageSharp_ArrayPoolMemoryAllocator 20 734.9 ms 174.4 ms 9.56 ms 0.24 1000.0000 1000.0000 1000.0000 1,237,298 KB
Magick 20 1,545.1 ms 624.1 ms 34.21 ms 0.51 - - - 106 KB
MagicScaler 20 389.5 ms 163.7 ms 8.97 ms 0.13 - - - 101 KB
SkiaBitmap 20 889.6 ms 182.9 ms 10.03 ms 0.30 - - - 118 KB
NetVips 20 441.4 ms 686.1 ms 37.61 ms 0.15 - - - 102 KB

VirtualAlloc commit lifetimes graph with ArrayPoolMemoryAllocator

image

VirtualAlloc commit lifetimes graph with the new allocator, demonstrating the trimming

image

VirtualAlloc commit lifetimes graph with pool size set to zero

image

I would be happy to see some expert feedback on this solution, especially for the finalizer tricks.

/cc @Sergio0694 @saucecontrol @br3aker

@br3aker
Copy link
Contributor

br3aker commented Aug 8, 2021

(We need the timer since unmanaged allocations don't trigger GC)

Might be worth testing trimming solely via GC.AddMemoryPressure & gen2 callback?

@antonfirsov
Copy link
Member Author

@br3aker I'm not sure if AddMemoryPressure / RemoveMemoryPressure contributes to Gen2 allocation budget or not. Might make sense to try out, but I think the timer is not that expensive, and trimming is configured to be done every 1 minute (mimicking ArrayPool.shared) anyways.

@codecov
Copy link

codecov bot commented Aug 8, 2021

Codecov Report

Merging #1730 (5eaa632) into master (4587649) will increase coverage by 0%.
The diff coverage is 89%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #1730    +/-   ##
=======================================
  Coverage      87%     87%            
=======================================
  Files         935     944     +9     
  Lines       49300   49753   +453     
  Branches     6102    6165    +63     
=======================================
+ Hits        43175   43575   +400     
- Misses       5115    5157    +42     
- Partials     1010    1021    +11     
Flag Coverage Δ
unittests 87% <89%> (+<1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Advanced/AdvancedImageExtensions.cs 84% <ø> (+24%) ⬆️
src/ImageSharp/Common/Helpers/DebugGuard.cs 0% <0%> (ø)
...eSharp/Common/Helpers/Shuffle/IComponentShuffle.cs 100% <ø> (ø)
src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs 92% <ø> (ø)
...icInterpretation/WhiteIsZero24TiffColor{TPixel}.cs 0% <0%> (ø)
...rp/Memory/Allocators/Internals/BasicArrayBuffer.cs 100% <ø> (ø)
...Sharp/Memory/Allocators/SimpleGcMemoryAllocator.cs 66% <ø> (-14%) ⬇️
src/ImageSharp/Memory/UnmanagedMemoryManager{T}.cs 62% <0%> (ø)
...s/Convolution/Convolution2PassProcessor{TPixel}.cs 61% <40%> (ø)
...harp/Memory/Allocators/Internals/Gen2GcCallback.cs 44% <44%> (ø)
... and 118 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4587649...5eaa632. Read the comment docs.

@JimBobSquarePants
Copy link
Member

Cannot wait to get stuck into reading this! 🤩

@antonfirsov
Copy link
Member Author

@JimBobSquarePants been thinking a lot & chatting with folks on the C#/lowlevel discord channel, I'm no longer sure if using unsafe memory is the right thing to do.

The problem is that an update to 1.1 may turn minor bugs and performance issues into security errors for users of GetRowSpan(y) and TryGetSinglePixelSpan(). These are methods which look completely safe, but will become inherently unsafe with this PR. (See the warnings added to docs in c9d1396)

On the other hand, SkiaSharp merged a similar API without spending a single minute on security concerns: mono/SkiaSharp#1242

@JimBobSquarePants
Copy link
Member

@antonfirsov That's the IMemory<T> finalizer warning yeah? That would require some impressively bad code to trigger but you're right we should do a careful review.

I need to do some reading there and maybe see if we can get someone from the runtime team who worked in that area to comment.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 9, 2021

Strictly speaking, the finalizer warning doesn't apply to us since the SafeHandle is finalizable and returns memory anyways.

That would require some impressively bad code to trigger

First I also thought so, but in fact it can be as simple as:

var image = new Image<Rgba32>(w, h); // no using
Span<Rgba32> span = image.GetPixelRowSpan(0); // last use of the object `image`, finalizers may run after this point

// some relatively long running code here to allow the finalizers to finish

span[0] = default;  // memory corruption

@JimBobSquarePants
Copy link
Member

I wonder if we could write an analyzer?

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 9, 2021

I need to do some reading there and maybe see if we can get someone from the runtime team who worked in that area to comment.

That would be awesome, but I'm afraid they would share the concerns around unsafe memory and push back.

I wonder if we could write an analyzer?

Would it work out of the box just by using the library? Note that it won't help existing users doing a package update without recompilation, and then running into a potential security issue.

@JimBobSquarePants
Copy link
Member

The trick, I think, would be to make the analyzer a dependency of the main library Like Xunit do.

Have we made any breaking changes that require recompilation? Maybe we should just to ensure people should rebuild. 👿

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

GrabYourPitchforks — Yesterday at 6:09 PM
The original design for Memory didn't have a span accessor; you had to provide a delegate and the span would be passed as a parameter to the delegate. This solved many (but not all) ownership problems.

There is a safe, breaking way to re-implement span accessors by using delegates, inspired by the comment above:

  public class Image<T>
  {
-     public Span<T> GetPixelRowSpan(int y);
-     public bool TryGetSinglePixelSpan(out Span<TPixel> span);
+     public void ProcessPixels(Action<PixelAccessor<T>> rowProcessor);
+     public bool TryProcessSinglePixelSpan(SpanAction<T> pixelProcessor);
  }

+ public ref struct PixelAcceessor<T>
+ {
+     Span<T> GetRowSpan(int y);
+ }

The simplest thing would be to go ahead with this breaking change and bump ImageSharp version number to 2.0. The improvements will justify the change.

@JimBobSquarePants
Copy link
Member

bump ImageSharp version number to 2.0

2.0 was to be my kill all old target frameworks release. I want to ship a working V1 of Fonts and Drawing before starting work on it.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

That can be 3.0 then. We follow semantic versioning more or less, so no point to be afraid of major version jumps as breaking changes land IMO.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

But I'm also fine with a hard-breaking 1.1, this is more about PR and communication than anything else,

However, renaming the milestones seems to be better thing to do for me, we can even benefit out of it.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Aug 10, 2021

My only issue with jumping from 2.0 to 3.0 would that in real terms it would probably occur over a short timespan which, in my opinion does reflect well on the quality. 1.1 would be, by far, my next desired target.

This is a massive breaking change though so I'm deeply conflicted. 🙁

@Sergio0694
Copy link
Member

Disclaimer: I'm not home and with barely functioning internet, so might've missed some bits here.

I have a question regarding PixelAcceessor<T>, isn't that quite similar to what we already offer with the ProcessPixelRows___ APIs already? Given that those also allow working on arbitrary pixel types and offer batched and vectorized pixel format conversion, should we consider just pushing devs currently accessing individual pixels towards those APIs instead? If not and if we still want to offer a way to access individual pixel specific values, should we align the design of these new APIs to follow those, for consistency? Just some random thoughts here 😄

Btw amazing work on all this @antonfirsov, I'll also need to find some time to carefully go through all this like James said and have a proper read, as the whole investigation seems super interesting! 🚀

@JimBobSquarePants
Copy link
Member

After careful consideration. I'm up for a V2 release. It's good opportunity to fix a few things plus we are already adding a significant amount of fixes/functionality to the release so let's make a show of it.

This was referenced Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pixel processing breaking changes & API discussion Remove all uses of IManagedByteBuffer More efficient MemoryAllocator

6 participants