Skip to content

Conversation

@pedrobsaila
Copy link
Contributor

Fixes #108333

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 1, 2025
@rhuijben
Copy link
Contributor

rhuijben commented Jun 1, 2025

How is the stats object now collected?
With it referenced I don't think the finalizer will ever run?

@krwq
Copy link
Member

krwq commented Dec 11, 2025

@pedrobsaila I agree with @rhuijben that it's not clear what will happen. Perhaps some simpler approach to fix this will work... just two static longs and Interlocked.Add - since we never increment both at the same we should be good with this. We should be able to get rid of stats class completely and logic becomes much simpler. I don't think it will have much overhead as well

Copy link
Member

@krwq krwq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at the simpler approach of Interlocked.Add with two longs (perf results recommended)

@krwq
Copy link
Member

krwq commented Dec 11, 2025

In case Interlocked.Add is not suitable also consider not having Stats class at all, perhaps struct will be more suitable then you wouldn't have a reference you'd block

@krwq
Copy link
Member

krwq commented Dec 11, 2025

IMO for the regular scenario this should be win for x64 - as far as I understand this should be close to a single instruction. For x32 - well they'll suffer a bit but less allocations and it's not super common scenario for perf anyway

@jkotas
Copy link
Member

jkotas commented Dec 11, 2025

I understand this should be close to a single instruction

Interlocked.Add is a single instruction on x64. This single instruction can take thousands of cycles if multiple threads try to increment the same memory location in parallel.

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Dec 14, 2025

@pedrobsaila I agree with @rhuijben that it's not clear what will happen. Perhaps some simpler approach to fix this will work... just two static longs and Interlocked.Add - since we never increment both at the same we should be good with this. We should be able to get rid of stats class completely and logic becomes much simpler. I don't think it will have much overhead as well

see the comment #108333 (comment) explaining the solution I implemented :

Instead of having the ThreadLocal point to the Stats directly, have the ThreadLocal point to some StatsHandle that has a reference to Stats inside. Then change the _allStats to be List and a member of Stats is a WeakReference to the StatsHandle. This way when the thread exits the Stats object doesn't leave the list immediately. The WeakReference allows it to be detected as dead and pruned eventually at the same time it is copied over to the accumulated stats. Pruning the list whenever a new thread is added would probably be reasonable.

@pedrobsaila
Copy link
Contributor Author

Please take a look at the simpler approach of Interlocked.Add with two longs (perf results recommended)

I'll try to implement it

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Dec 14, 2025

Perf test :

py .\scripts\benchmarks_ci.py --frameworks net11.0 --filter Microsoft.Extensions.Caching.Memory.Tests* --corerun "C:\OSS\base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe" "C:\OSS\runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe"

Results :
BenchmarkDotNet v0.14.1-nightly.20250107.205, Windows 11 (10.0.26200.7462)
12th Gen Intel Core i7-12700H 2.30GHz, 1 CPU, 20 logical and 14 physical cores
.NET SDK 11.0.100-alpha.1.25613.101
[Host] : .NET 11.0.0 (11.0.25.61401), X64 RyuJIT AVX2
Job-BKTXKG : .NET 11.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-QLWEOI : .NET 11.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
GetHit Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.492 ns 0.2171 ns 0.2031 ns 23.458 ns 23.196 ns 23.870 ns 1.00 0.01 - - - NA
GetHit Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.360 ns 0.1737 ns 0.1540 ns 23.291 ns 23.203 ns 23.704 ns 0.99 0.01 - - - NA
TryGetValueHit Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.965 ns 0.2176 ns 0.2035 ns 23.885 ns 23.694 ns 24.394 ns 1.00 0.01 - - - NA
TryGetValueHit Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.480 ns 0.0927 ns 0.0724 ns 23.484 ns 23.384 ns 23.609 ns 0.98 0.01 - - - NA
GetMiss Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.398 ns 0.1803 ns 0.1598 ns 23.403 ns 23.109 ns 23.662 ns 1.00 0.01 - - - NA
GetMiss Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.218 ns 0.1754 ns 0.1641 ns 23.191 ns 23.017 ns 23.580 ns 0.99 0.01 - - - NA
TryGetValueMiss Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.885 ns 0.2395 ns 0.2240 ns 23.811 ns 23.581 ns 24.318 ns 1.00 0.01 - - - NA
TryGetValueMiss Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 23.936 ns 0.3531 ns 0.3303 ns 23.792 ns 23.592 ns 24.541 ns 1.00 0.02 - - - NA
SetOverride Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 48.629 ns 1.2697 ns 1.4622 ns 48.260 ns 46.935 ns 51.116 ns 1.00 0.04 0.0083 - 104 B 1.00
SetOverride Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 48.140 ns 1.1506 ns 1.3250 ns 48.110 ns 46.560 ns 50.569 ns 0.99 0.04 0.0083 - 104 B 1.00
CreateEntry Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 7.219 ns 0.2026 ns 0.2252 ns 7.150 ns 6.917 ns 7.849 ns 1.00 0.04 0.0083 - 104 B 1.00
CreateEntry Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 7.260 ns 0.2296 ns 0.2644 ns 7.195 ns 6.977 ns 7.801 ns 1.01 0.05 0.0083 - 104 B 1.00
AddThenRemove_NoExpiration Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 12,094.067 ns 270.2794 ns 311.2542 ns 12,103.877 ns 11,654.985 ns 12,536.550 ns 1.00 0.04 1.9752 0.0470 25081 B 1.00
AddThenRemove_NoExpiration Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 11,978.743 ns 263.5200 ns 303.4701 ns 11,791.359 ns 11,686.858 ns 12,520.976 ns 0.99 0.03 1.9926 0.0486 25081 B 1.00
AddThenRemove_AbsoluteExpiration Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 12,047.811 ns 183.1483 ns 171.3170 ns 12,101.333 ns 11,707.895 ns 12,337.566 ns 1.00 0.02 1.9917 0.0474 25081 B 1.00
AddThenRemove_AbsoluteExpiration Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 12,006.855 ns 236.8868 ns 272.7992 ns 12,005.589 ns 11,533.584 ns 12,500.784 ns 1.00 0.03 1.9841 0.0472 25081 B 1.00
AddThenRemove_RelativeExpiration Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 11,919.416 ns 265.5337 ns 305.7890 ns 11,793.403 ns 11,565.048 ns 12,424.461 ns 1.00 0.04 1.9895 0.0485 25081 B 1.00
AddThenRemove_RelativeExpiration Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 11,964.243 ns 252.4622 ns 290.7359 ns 11,849.494 ns 11,617.839 ns 12,661.347 ns 1.00 0.03 1.9864 0.0484 25081 B 1.00
AddThenRemove_SlidingExpiration Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 11,886.069 ns 259.9496 ns 299.3584 ns 11,784.046 ns 11,511.457 ns 12,431.519 ns 1.00 0.03 1.9712 0.0481 25081 B 1.00
AddThenRemove_SlidingExpiration Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 11,914.906 ns 234.6409 ns 260.8028 ns 11,770.152 ns 11,597.445 ns 12,383.894 ns 1.00 0.03 1.9692 0.0469 25081 B 1.00
AddThenRemove_ExpirationTokens Job-BKTXKG \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 17,458.902 ns 340.0463 ns 377.9606 ns 17,449.885 ns 16,885.478 ns 18,043.080 ns 1.00 0.03 2.9762 0.1353 37905 B 1.00
AddThenRemove_ExpirationTokens Job-QLWEOI \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 17,330.767 ns 329.7917 ns 323.8996 ns 17,158.839 ns 17,062.514 ns 18,067.870 ns 0.99 0.03 2.9796 0.1419 37905 B 1.00

@jkotas
Copy link
Member

jkotas commented Dec 14, 2025

These micro-benchmarks are insufficient to observe the impact on multithreaded workloads. Try to build multi-threaded version of GetHit micro-benchmark - you can use this ArrayPool micro-benchmark as a blueprint https://github.com/dotnet/performance/blob/f7d7e64dcd2a799dcc2631ba787728c8ebc56141/src/benchmarks/micro/libraries/System.Buffers/ArrayPoolTests.cs#L62-L71

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Dec 14, 2025

Tried the following :

[GlobalSetup(Targets = new[] { nameof(GetHitParallel) })]
public void SetupBasic()
{
    _memCache = new MemoryCache(new MemoryCacheOptions() { TrackStatistics = true });
    for (var i = 0; i < 1024; i++)
    {
        _memCache.Set(i, i.ToString());
    }
}

[GlobalCleanup(Targets = new[] { nameof(GetHitParallel) })]
public void CleanupBasic() => _memCache.Dispose();


[Benchmark]
public void GetHitParallel()
{
    Parallel.For(0, 100_000_000, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, i => _memCache.Get("256"));
}

Got :

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Allocated Alloc Ratio
GetHitParallel Job-UGGEHX \base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 230.9 ms 5.08 ms 5.85 ms 232.2 ms 222.4 ms 242.7 ms 1.00 0.03 - NA
GetHitParallel Job-IXLPFM \runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe 247.3 ms 12.84 ms 14.79 ms 239.6 ms 228.0 ms 281.4 ms 1.07 0.07 - NA

With some warnings :

  • MultimodalDistribution
    MemoryCacheTests.GetHitParallel: PowerPlanMode=00000000-0000-0000-0000-000000000000, Toolchain=\base-runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe, IterationTime=250ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> It seems that the distribution can have several modes (mValue = 2.89)
  • MemoryCacheTests.GetHitParallel: PowerPlanMode=00000000-0000-0000-0000-000000000000, Toolchain=\runtime\artifacts\bin\testhost\net11.0-windows-Release-x64\shared\Microsoft.NETCore.App\11.0.0\corerun.exe, IterationTime=250ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> It seems that the distribution is bimodal (mValue = 3.64)

@jkotas
Copy link
Member

jkotas commented Dec 17, 2025

@EgorBot -intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using Microsoft.Extensions.Caching.Memory;

public class MemoryCacheTest
{
    MemoryCache _memCache;

    [GlobalSetup(Targets = new[] { nameof(GetHitParallel) })]
    public void SetupBasic()
    {
        _memCache = new MemoryCache(new MemoryCacheOptions() { TrackStatistics = true });
        for (var i = 0; i < 1024; i++)
        {
            _memCache.Set(i.ToString(), new object());
        }
    }

    [GlobalCleanup(Targets = new[] { nameof(GetHitParallel) })]
    public void CleanupBasic() => _memCache.Dispose();


    [Benchmark]
    public async Task GetHitParallel()
    {
        await Task.WhenAll(
            Enumerable
            .Range(0, Environment.ProcessorCount)
            .Select(_ =>
            Task.Run(async delegate
            {
                for (int i = 0; i < 1000000; i++)
                {
                    var x = _memCache.Get("256");
                    if (x == null) Environment.FailFast("Unexpected!");
                }
            })));

    }
}

@jkotas
Copy link
Member

jkotas commented Dec 17, 2025

I have fixed the microbenchmark to actually measure GetHitParallel performance. It shows 2+x regression with the change in this PR:

EgorBot/runtime-utils#571 (comment)

@pedrobsaila
Copy link
Contributor Author

For my understanding, why using async for a CPU bound workload ?

@jkotas
Copy link
Member

jkotas commented Dec 18, 2025

For my understanding, why using async for a CPU bound workload ?

Copy&paste of the ArrayPool microbenchmark that I have linked to above. It is convenient to write it using async - it is certainly possible to write it without async.

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Dec 18, 2025

@krwq should I roll back the PR to the previous solution or do you have in mind a different implementation that is simpler/more performant ?

@rosebyte
Copy link
Member

Hello @pedrobsaila, I like the simplicity and readability you achieved using Interlocked.Increment. However, we need every bit of performance here, so the ThreadLocal implementation (although admittedly less elegant) seems like a better fit.

What I found interesting is @noahfalk’s comment in the issue about using handlers to keep the numbers alive. I imagine the handler working something like this:

internal sealed class StatsHandler
{
    private readonly MemoryCache _cache;
    private readonly int _index;
    public Stats Value { get; }

    public StatsHandler(MemoryCache memoryCache)
    {
        _cache = memoryCache;
        Value = new Stats(this);
        _index = memoryCache.AddToStats(Value);
    }

    ~StatsHandler() => _cache.RemoveFromStats(Value, _index);
}

Its role is simply to control the Stats instance - creating it, adding it to _allStats, providing access, and eventually removing it from there. The Stats is then just a simple class to hold the numbers:

internal sealed class Stats
{
    public long Hits;
    public long Misses;
}

The _allStats and _stats fields then have different types:

private readonly ThreadLocal<StatsHandler>? _stats;
private readonly List<Stats>? _allStats;

And as a little bonus, we don`t have to iterate the whole _allStats list now to remove one as we only add new instances to the end so the one to remove must be on the same or lower index it was added:

private void RemoveFromStats(Stats current, int index)
{
    lock (_allStats!)
    {
        _accumulatedHits += Volatile.Read(ref current.Hits);
        _accumulatedMisses += Volatile.Read(ref current.Misses);

        for (var i = index; i >= 0; i--)
        {
            if (ReferenceEquals(_allStats[i], current))
            {
                _allStats.RemoveAt(i);
                break;
            }
        }

        _allStats.TrimExcess();
    }
}

There will be a few places in the class that need to be updated to account for these changes, but in the end the performance profile should stay roughly the same, just with the bug fixed. This is just an idea, what do you think, does it make any sense?

@rosebyte
Copy link
Member

rosebyte commented Jan 29, 2026

I see you actually had an initial implementation very close to the one I was thinking about, it seems plausible to me, @jkotas, I guess you reviewed the Interlocked.Increment implementation only, but how about keeping the ThreadLocal with handlers so we keep the stats alive between the thread is garbage collected and its finalizer executes? Is there a catch I don`t see?

@rosebyte rosebyte added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 29, 2026
@jkotas
Copy link
Member

jkotas commented Jan 29, 2026

#116193 (comment) suggested that the original implementation had a leak. I have not reviewed it in detail. I am sure you can make it work. We need to make sure that it does not leak (ie it does not keep the stats for dead threads around forever) and that it otherwise performs well.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 31, 2026
@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Jan 31, 2026

Results of the program in Release configuration :

using Microsoft.Extensions.Caching.Memory;

MemoryCache cache = new(new MemoryCacheOptions { TrackStatistics = true });

void RunThread()
{
    Thread t = new(() =>
    {
        for (int j = 0; j < 10_000; j += 1)
        {
            _ = cache.Get("");
        }

        RunThread();
    })
    {
        Name = "Cache Worker",
    };
    t.Start();
}

for (int i = 0; i < Environment.ProcessorCount - 1; i += 1)
{
    RunThread();
}

Thread integrityThread = new(() =>
{
    long lastValue = -1;
    while (true)
    {
        long newValue = cache.GetCurrentStatistics()!.TotalMisses;
        if (newValue < lastValue)
        {
            Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} ERROR: total misses decreased from {lastValue} to {newValue} (-{lastValue - newValue})");
        }

        lastValue = newValue;
    }
})
{
    Name = "Stats Integrity Checker"
};
integrityThread.Start();
integrityThread.Join();
  • With Microsoft.Extensions.Caching.Memory nuget package Version 10.02 :
image
  • With local branch :
Screenshot 2026-01-31 165555

Using perfview, we see that statsHandle object do get finalized with same frequency as Thread object :

Finalized Object Counts for Process :

Type Count
System.Threading.Thread 263211
FinalizationHelper[Microsoft.Extensions.Caching.Memory.MemoryCache+StatsHandler] 263206
StatsHandler 262972
System.Gen2GcCallback 32
DestroyScout 28
System.Reflection.Emit.DynamicResolver 4
System.Threading.ThreadPoolWorkQueueThreadLocals 4
ThreadLocalNodeFinalizationHelper 4

@jeffhandley jeffhandley requested a review from krwq February 1, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-Extensions-Caching community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent MemoryCache stats

5 participants