Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

stephentoub · 2021-07-19T19:32:06Z

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

This is on my 12-logical core box:

Method	Toolchain	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Allocated
NonPooling	\main\CoreRun.exe	4.314 s	0.0795 s	0.1005 s	1.00	1933000.0000	483000.0000	11,800,056 KB
NonPooling	\pr\corerun.exe	4.284 s	0.0188 s	0.0167 s	0.99	1933000.0000	483000.0000	11,800,063 KB

Pooling	\main\CoreRun.exe	3.010 s	0.0452 s	0.0423 s	1.00	-	-	323 KB
Pooling	\pr\corerun.exe	2.874 s	0.0452 s	0.0423 s	0.95	-	-	203 KB

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Diagnosers;
using System.Runtime.CompilerServices;

[MemoryDiagnoser]
public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private const int Concurrency = 256;
    private const int Iters = 100_000;

    [Benchmark]
    public Task NonPooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        static async ValueTask A() => await B().ConfigureAwait(false);

        static async ValueTask B() => await C().ConfigureAwait(false);

        static async ValueTask C() => await D().ConfigureAwait(false);

        static async ValueTask D() => await Task.Yield();
    }

    [Benchmark]
    public Task Pooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask A() => await B().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask B() => await C().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask C() => await D().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask D() => await Task.Yield();
    }
}

ghost · 2021-07-19T19:32:12Z

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info in area-owners.md if you want to be subscribed.

Issue Details

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects.

The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

Method	Toolchain	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Allocated
NonPooling	\main\CoreRun.exe	4.314 s	0.0795 s	0.1005 s	1.00	1933000.0000	483000.0000	11,800,056 KB
NonPooling	\pr\corerun.exe	4.284 s	0.0188 s	0.0167 s	0.99	1933000.0000	483000.0000	11,800,063 KB

Pooling	\main\CoreRun.exe	3.010 s	0.0452 s	0.0423 s	1.00	-	-	323 KB
Pooling	\pr\corerun.exe	2.874 s	0.0452 s	0.0423 s	0.95	-	-	203 KB

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Diagnosers;
using System.Runtime.CompilerServices;

[MemoryDiagnoser]
public class Program
{
    public static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    private const int Concurrency = 256;
    private const int Iters = 100_000;

    [Benchmark]
    public Task NonPooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        static async ValueTask A() => await B().ConfigureAwait(false);

        static async ValueTask B() => await C().ConfigureAwait(false);

        static async ValueTask C() => await D().ConfigureAwait(false);

        static async ValueTask D() => await Task.Yield();
    }

    [Benchmark]
    public Task Pooling()
    {
        return Task.WhenAll(from i in Enumerable.Range(0, Concurrency)
                            select Task.Run(async delegate
                            {
                                for (int i = 0; i < Iters; i++)
                                    await A().ConfigureAwait(false);
                            }));

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask A() => await B().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask B() => await C().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask C() => await D().ConfigureAwait(false);

        [AsyncMethodBuilder(typeof(PoolingAsyncValueTaskMethodBuilder))]
        static async ValueTask D() => await Task.Yield();
    }
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Threading.Tasks`, `tenet-performance`
Milestone:	6.0.0

adamsitnik

LGTM!

It also reduces concerns about larger machines being more negatively impacted by the caching

To validate that you could use this template, modify it and run the benchmarks with and without your changes using the AMD (32 cores), ARM (48 cores), and Mono machine (56 cores).

...m.Private.CoreLib/src/System/Runtime/CompilerServices/PoolingAsyncValueTaskMethodBuilderT.cs

The current scheme caches one instance per thread in a ThreadStatic, and then has a locked stack that all threads contend on; then to avoid blocking a thread while accessing the cache, locking is done with TryEnter rather than Enter, simply skipping the cache if there is any contention. The locked stack is capped by default at ProcessorCount*4 objects. The new scheme is simpler: one instance per thread, one instance per core. This ends up meaning fewer objects may be cached, but it also almost entirely eliminates contention between threads trying to rent/return objects. As a result, under heavy load it can actually do a better job of using pooled objects as it doesn't bail on using the cache in the face of contention. It also reduces concerns about larger machines being more negatively impacted by the caching. Under lighter load, since we don't cache as many objects, it does mean we may end up allocating a bit more, but generally not much more (and the size of the object we do allocate is a reference-field smaller).

stephentoub added area-System.Threading.Tasks tenet-performance Performance related issue labels Jul 19, 2021

stephentoub added this to the 6.0.0 milestone Jul 19, 2021

stephentoub force-pushed the tlsprocpool branch from 76459fc to 8889b7d Compare July 20, 2021 02:40

adamsitnik approved these changes Jul 20, 2021

View reviewed changes

...m.Private.CoreLib/src/System/Runtime/CompilerServices/PoolingAsyncValueTaskMethodBuilderT.cs Outdated Show resolved Hide resolved

stephentoub added 2 commits July 20, 2021 09:21

Address PR feedback

e839f25

stephentoub force-pushed the tlsprocpool branch from 8889b7d to e839f25 Compare July 20, 2021 13:27

karelz mentioned this pull request Jul 20, 2021

Reenable Http2_ServerSendsInvalidSettingsValue_Error test #1581

Closed

stephentoub merged commit 776053f into dotnet:main Jul 20, 2021

stephentoub deleted the tlsprocpool branch July 20, 2021 22:06

ghost locked as resolved and limited conversation to collaborators Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

stephentoub commented Jul 19, 2021 •

edited

Loading

ghost commented Jul 19, 2021

adamsitnik left a comment

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

Revamp caching scheme in PoolingAsyncValueTaskMethodBuilder #55955

Conversation

stephentoub commented Jul 19, 2021 • edited Loading

ghost commented Jul 19, 2021

adamsitnik left a comment

Choose a reason for hiding this comment

stephentoub commented Jul 19, 2021 •

edited

Loading