Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock that uses congestion detection for self-tuning #93879

Closed
wants to merge 6 commits into from

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Oct 23, 2023

This is basically the same as #87672 with some additions

The main difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

The essence of the algorithm is:

  • when we try to acquire the lock on the slow path loop, we also observe whether the lock is changing its ownership.
  • if, while we were trying to acquire, the lock was owned by 2 other threads, we consider that as a signal that there is too much competition and dial down the spin limit.
  • shorter spin limit makes unsuccessful spinners to sleep earlier, which statistically reduces the overall number of concurrent spinners and decreases competition.
  • conversely, if successful spinning gets close to the limit and no congestion is seen, the limit is increased.
  • every thread that acquires the lock via the slow path gets to cast a vote on how the acquiring went (too crowded, could allow more spin, nothing special)

The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Oct 23, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

This is basically the same as #87672

The difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

Author: VSadov
Assignees: VSadov
Labels:

new-api-needs-documentation, area-NativeAOT-coreclr

Milestone: -

@ghost
Copy link

ghost commented Oct 23, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

This is basically the same as #87672 with some additions

The difference is that TryEnterSlow uses the same algorithm as in NativeAOT lock, so that congestion detection could be used for self-tuning.

The essence of the algorithm is:

  • when we try to acquire the lock on the slow path loop, we also observe whether the lock is changing its ownership.
  • if, while we were trying to acquire, the lock was passed between 2 other threads, we consider that as a signal that there is too much competition and dial down the spin limit.
  • shorter spin limit makes unsuccessful spinners to sleep earlier, which statistically reduces the overall number of concurrent spinners and decreases competition.
  • conversely, if successful spinning gets close to the limit and no congestion is seen, the limit is increased.
  • every thread that acquires the lock via the slow path gets to cast a vote on how the acquiring went (too crowded, could allow more spin, nothing special)

The goal of this scheme is to scale better in cases of heavy concurrent use of the lock (i.e. > 4-8 threads).

Author: VSadov
Assignees: VSadov
Labels:

area-System.Threading, new-api-needs-documentation, area-NativeAOT-coreclr

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Oct 23, 2023

Here is an example of using this Lock in a short-held scenario.
"Short-held" here means that threads spend relatively little time inside the lock compared to the time outside of the lock.

Scenarios like 50/50 inside/outside can also be measured, but less interesting as such scenario on its own has little room to scale beyond 2 threads. (3 threads can't all spend 50% of their total time inside the same lock)

The code:

using System.Diagnostics;

namespace ConsoleApp12
{
    internal class Program
    {
        private const int iters = 10000000;

        static void Main(string[] args)
        {
            for (; ; )
            {
                for (int i = 0; i < 7; i++)
                {
                    int thrCount = 1 << i;
                    System.Console.WriteLine("Threads:" + thrCount);

                    for (int j = 0; j < 4; j++)
                    {
                        System.Console.Write("Fat Lock: ");
                        RunMany(() => FatLock(thrCount), thrCount);
                    }
                }

                System.Console.WriteLine();
            }
        }

        static void RunMany(Action f, int threadCount)
        {
            Thread[] threads = new Thread[threadCount];
            bool start = false;

            for (int i = 0; i < threads.Length; i++)
            {
                threads[i] = new Thread(
                    () =>
                    {
                        while (!start) Thread.SpinWait(1);
                        f();
                    }
                );

                threads[i].Start();
            }

            Thread.Sleep(10);
            start = true;

            Stopwatch sw = Stopwatch.StartNew();
            for (int i = 0; i < threads.Length; i++)
            {
                threads[i].Join();
            }

            System.Console.WriteLine("Ops per msec: " + iters / sw.ElapsedMilliseconds);
        }

        private static int Fib(int n) => n < 2 ? 1 : Fib(n - 1) + Fib(n - 2);
        private static int ComputeSomething(Random random)
        {
            int delay = random.Next(4, 10);
            return Fib(delay);
        }

        static System.Threading.Lock fatLock = new System.Threading.Lock();

        public static int sharedCounter = 0;
        public static Dictionary<int, int> sharedDictionary = new Dictionary<int, int>();

        static void FatLock(int thrCount)
        {
            Random random = new Random();
            for (int i = 0; i < iters / thrCount; i++)
            {
                // do some computation
                int value = ComputeSomething(random);
                var scope = fatLock.EnterScope();
                {
                    // update shared state
                    sharedCounter += value;
                    sharedDictionary[i] = sharedCounter;
                }
                scope.Dispose();
            }
        }
    }
}

Results on:

On Windows10 x64
AMD Ryzen 9 5950X 16-Core Processor
Logical processors: 32

Higher number is better.

=== Lock with congestion sensing (in this PR).

Threads:1
Fat Lock: Ops per msec: 22988
Fat Lock: Ops per msec: 23201
Fat Lock: Ops per msec: 23310
Fat Lock: Ops per msec: 23201
Threads:2
Fat Lock: Ops per msec: 20833
Fat Lock: Ops per msec: 20703
Fat Lock: Ops per msec: 20283
Fat Lock: Ops per msec: 20618
Threads:4
Fat Lock: Ops per msec: 16103
Fat Lock: Ops per msec: 16181
Fat Lock: Ops per msec: 16051
Fat Lock: Ops per msec: 16129
Threads:8
Fat Lock: Ops per msec: 16835
Fat Lock: Ops per msec: 16750
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16977
Threads:16
Fat Lock: Ops per msec: 16891
Fat Lock: Ops per msec: 17035
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16920
Threads:32
Fat Lock: Ops per msec: 16694
Fat Lock: Ops per msec: 16778
Fat Lock: Ops per msec: 16778
Fat Lock: Ops per msec: 16835
Threads:64
Fat Lock: Ops per msec: 16835
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16863
Fat Lock: Ops per msec: 16778

=== results for #87672

Threads:1
Fat Lock: Ops per msec: 23041
Fat Lock: Ops per msec: 23529
Fat Lock: Ops per msec: 23255
Fat Lock: Ops per msec: 23474
Threads:2
Fat Lock: Ops per msec: 20920
Fat Lock: Ops per msec: 20790
Fat Lock: Ops per msec: 20920
Fat Lock: Ops per msec: 20920
Threads:4
Fat Lock: Ops per msec: 11507
Fat Lock: Ops per msec: 11534
Fat Lock: Ops per msec: 11148
Fat Lock: Ops per msec: 11025
Threads:8
Fat Lock: Ops per msec: 9199
Fat Lock: Ops per msec: 9225
Fat Lock: Ops per msec: 9174
Fat Lock: Ops per msec: 9208
Threads:16
Fat Lock: Ops per msec: 9208
Fat Lock: Ops per msec: 9157
Fat Lock: Ops per msec: 9157
Fat Lock: Ops per msec: 9157
Threads:32
Fat Lock: Ops per msec: 9165
Fat Lock: Ops per msec: 9149
Fat Lock: Ops per msec: 9140
Fat Lock: Ops per msec: 9165
Threads:64
Fat Lock: Ops per msec: 9115
Fat Lock: Ops per msec: 9140
Fat Lock: Ops per msec: 9099
Fat Lock: Ops per msec: 9090

@VSadov
Copy link
Member Author

VSadov commented Oct 23, 2023

Here is the results for the same benchmark as above, but with 2 locks. Sometimes a program has more than one lock.

It is the same driver code, just this part is different:

. . . 
. . . 
        static System.Threading.Lock fatLock1 = new System.Threading.Lock();
        static System.Threading.Lock fatLock2 = new System.Threading.Lock();

        public static int sharedCounter1 = 0;
        public static int sharedCounter2 = 0;

        public static Dictionary<int, int> sharedDictionary1 = new Dictionary<int, int>();
        public static Dictionary<int, int> sharedDictionary2 = new Dictionary<int, int>();

        static void FatLock(int thrCount)
        {
            Random random = new Random();
            for (int i = 0; i < iters / thrCount; i++)
            {
                // do some computation
                int value = ComputeSomething(random);

                if (i % 2 == 0)
                {
                    var scope = fatLock1.EnterScope();
                    {
                        // update shared state
                        sharedCounter1 += value;
                        sharedDictionary1[i] = sharedCounter1;
                    }
                    scope.Dispose();

                }
                else
                {
                    var scope = fatLock2.EnterScope();
                    {
                        // update shared state
                        sharedCounter2 += value;
                        sharedDictionary2[i] = sharedCounter2;
                    }
                    scope.Dispose();
                }
            }
        }

=== Congestion-sensing (this PR)

Threads:1
Fat Lock: Ops per msec: 22471
Fat Lock: Ops per msec: 22779
Fat Lock: Ops per msec: 22779
Fat Lock: Ops per msec: 22779
Threads:2
Fat Lock: Ops per msec: 23640
Fat Lock: Ops per msec: 23640
Fat Lock: Ops per msec: 23584
Fat Lock: Ops per msec: 23529
Threads:4
Fat Lock: Ops per msec: 17241
Fat Lock: Ops per msec: 17152
Fat Lock: Ops per msec: 17182
Fat Lock: Ops per msec: 17182
Threads:8
Fat Lock: Ops per msec: 15015
Fat Lock: Ops per msec: 14792
Fat Lock: Ops per msec: 15015
Fat Lock: Ops per msec: 15037
Threads:16
Fat Lock: Ops per msec: 14347
Fat Lock: Ops per msec: 14347
Fat Lock: Ops per msec: 14285
Fat Lock: Ops per msec: 14245
Threads:32
Fat Lock: Ops per msec: 13869
Fat Lock: Ops per msec: 13888
Fat Lock: Ops per msec: 13831
Fat Lock: Ops per msec: 14025
Threads:64
Fat Lock: Ops per msec: 13736
Fat Lock: Ops per msec: 13605
Fat Lock: Ops per msec: 13661
Fat Lock: Ops per msec: 13736

=== Base PR

Threads:1
Fat Lock: Ops per msec: 22471
Fat Lock: Ops per msec: 22321
Fat Lock: Ops per msec: 22421
Fat Lock: Ops per msec: 22371
Threads:2
Fat Lock: Ops per msec: 27700
Fat Lock: Ops per msec: 27548
Fat Lock: Ops per msec: 27173
Fat Lock: Ops per msec: 27548
Threads:4
Fat Lock: Ops per msec: 14598
Fat Lock: Ops per msec: 14577
Fat Lock: Ops per msec: 14534
Fat Lock: Ops per msec: 14556
Threads:8
Fat Lock: Ops per msec: 7806
Fat Lock: Ops per msec: 7782
Fat Lock: Ops per msec: 7733
Fat Lock: Ops per msec: 7824
Threads:16
Fat Lock: Ops per msec: 4844
Fat Lock: Ops per msec: 4847
Fat Lock: Ops per msec: 4821
Fat Lock: Ops per msec: 4840
Threads:32
Fat Lock: Ops per msec: 4791
Fat Lock: Ops per msec: 4793
Fat Lock: Ops per msec: 4789
Fat Lock: Ops per msec: 4844
Threads:64
Fat Lock: Ops per msec: 4759
Fat Lock: Ops per msec: 4761
Fat Lock: Ops per msec: 4761
Fat Lock: Ops per msec: 4752

@VSadov
Copy link
Member Author

VSadov commented Oct 23, 2023

In the last example, the base PR ends up with 3X worse throughput than the congestion-sensing approach.

The reason is excessive spinning in a lock that is heavily contested. When threads can't acquire the lock in one shot, they will try acquiring again in a loop, which would succeed, eventually - at a great cost, since spinning on highly contested state is expensive. (many cache misses, unnecessary transfers of the cache line ownership between cores, possibly some thermal effects, ...).
The implementation uses the "success" as a signal to allow even more spinning...

Excessive spinning also takes resources that could be used by other threads not involved in this lock, even if that is another lock under similar conditions. Thus two locks end up behaving much worse than just one.

While the system still makes progress, it is more fruitful for some of the contestants in busy locks to leave the "traffic jam" and take a nap, while the rest of the contestants can do the same work, but with less overhead.

NOTE: this should not sound as simply "spinning is bad". Spinning is good when it is cheap. It is the expensive spinning that should be avoided.

@kouvel
Copy link
Member

kouvel commented Oct 24, 2023

This PR is premature. Please see the plenty of information shared in #87672 and take it into consideration before raising PRs.

@kouvel kouvel closed this Oct 24, 2023
@kouvel
Copy link
Member

kouvel commented Oct 26, 2023

Let's go ahead and reopen this PR, I should probably let you manage your own PRs, and we're continuing to discuss on next steps.

@VSadov
Copy link
Member Author

VSadov commented Oct 27, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Nov 3, 2023

/azp run runtime-extra-platforms

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Nov 23, 2023

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Dec 18, 2023

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ghost
Copy link

ghost commented Jan 17, 2024

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 17, 2024
s_minSpinCount = DefaultMinSpinCount << SpinCountScaleShift;

// we can now use the slow path of the lock.
Volatile.Write(ref s_staticsInitializationStage, (int)StaticsInitializationStage.Usable);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is now functional and did not do anything that could take locks. The rest of initialization is optional and just need to eventually happen.

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants