-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Add normalized equivalent of YieldProcessor, retune some spin loops #13670
Conversation
Numbers from Thread.SpinWait divide count by 7 experiment: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):
Core i7-6700 (Skylake, 4-core, 8-thread):
|
Numbers from this PR: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):
Core i7-6700 (Skylake, 4-core, 8-thread):
|
/// when the resource becomes available. | ||
/// </summary> | ||
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35; | ||
internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than MaxSpinCountBeforeWait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is MaxSpinCountBeforeWait
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops renamed that one, will fix
// (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with | ||
// Thread.SpinWait. Otherwise, the following issues occur: | ||
// - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a | ||
// busy -spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra space in "busy -spin"
// contention), they may switch between one another, delaying work that can make progress. | ||
if (( | ||
_count >= YieldThreshold && | ||
(_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the formatting here reads strangely to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's formatted similarly to:
if (a ||
b)
where
a ==
(
c &&
d
)
This is how I typically format multi-line expressions, trying to align parentheses and putting each type of expression (&& or ||) separately, one condition per line unless the whole expression fits on one line. What would you suggest instead? I can separate parts of it into locals if you prefer.
Thanks, @kouvel. Do you have any throughput numbers on the thread pool with this change? |
The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like. |
Code used for Spin perf:
|
Code used for ReaderWriterLockSlim perf: var sw = new Stopwatch();
var scores = new double[16];
var startThreads = new ManualResetEvent(false);
bool stop = false;
var counts = new int[64];
var readerThreads = new Thread[readerThreadCount];
ThreadStart readThreadStart =
() =>
{
startThreads.WaitOne();
while (!stop)
{
rw.EnterReadLock();
rw.ExitReadLock();
Interlocked.Increment(ref counts[16]);
}
};
for (int i = 0; i < readerThreadCount; ++i)
{
readerThreads[i] = new Thread(readThreadStart);
readerThreads[i].IsBackground = true;
readerThreads[i].Start();
}
var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
var writerThreads = new Thread[writerThreadCount];
ThreadStart writeThreadStart =
() =>
{
startThreads.WaitOne();
while (!stop)
{
rw.EnterWriteLock();
rw.ExitWriteLock();
Interlocked.Increment(ref counts[32]);
}
};
for (int i = 0; i < writerThreadCount; ++i)
{
writerThreads[i] = new Thread(writeThreadStart);
writerThreads[i].IsBackground = true;
writerThreads[i].Start();
}
startThreads.Set();
// Warmup
Thread.Sleep(4000);
// Actual run
for(int i = 0; i < scores.Length; ++i)
{
counts[16] = 0;
counts[32] = 0;
Interlocked.MemoryBarrier();
sw.Restart();
Thread.Sleep(500);
sw.Stop();
int readCount = counts[16];
int writeCount = counts[32];
double elapsedMs = sw.Elapsed.TotalMilliseconds;
scores[i] =
new double[]
{
Math.Max(1, (readCount + writeCount) / elapsedMs),
Math.Max(1, writeCount / elapsedMs)
}.GeometricMean(readerThreadCount, writerThreadCount);
}
return scores; |
ThreadPool's global queue is a ConcurrentQueue, and CQ uses System.Threading.SpinWait when there are contentions on various operations, including dequeues. |
Ah ok, I included ConcurrentQueue, I'll add a test for thread pool as well |
Updated code above with the added thread pool throughput test. Looks like there's no change: Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):
Core i7-6700 (Skylake, 4-core, 8-thread):
|
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness |
@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness |
@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness |
/// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled | ||
/// when the resource becomes available. | ||
/// </summary> | ||
internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
35 [](start = 105, length = 2)
did we get this number from experimenting different scenarios? just curious how we come up with this number. and it doesn't matter the number of processors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with ManualResetEventSlim to get an initial number, applied the same number to other similar situations, and then tweaked up and down to see what was working. Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use up some CPU time unnecessarily. Depends on the situation too, like for SemaphoreSlim I had to double the spin iterations because the waiting there is a lot more expensive. Also depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think including number of processors (N) works well. Multiplying by N increases spinning on each thread by N, so total spinning across N threads is increased by N^2. When there are more processors contending on a resource, it may even be better to spin less and wait sooner to reduce contention since with more processors something like a mutex has the natural possibility of having more contention.
// usually better for that. | ||
// | ||
int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration; | ||
if (_count <= 30 && (1 << _count) < n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30 [](start = 30, length = 2)
would be nice to comment how we choose this number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a reference to Thread::InitializeYieldProcessorNormalized that describes and calculates it
{ | ||
get | ||
{ | ||
if (s_optimalMaxSpinWaitsPerSpinIteration != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s_optimalMaxSpinWaitsPerSpinIteration [](start = 20, length = 37)
Looks this one can be converted to readonly field initialized with GetOptimalMaxSpinWaitsPerSpinIterationInternal() so we can avoid checking 0 value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want do that since the first call would trigger the measurement that takes about 10 ms. Static construction of RuntimeThread probably happens during startup for most apps.
} | ||
|
||
return IsCompleted; | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false; [](start = 11, length = 14)
Is it possible between exiting the loop and executing the return, the task can get into completed state? I am asking to know if we should keep returning IsCompleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally it doesn't make any difference, the caller will do the right thing. Previously it made sense to check IsCompleted before returning because the loop would have stopped immediately after a wait. But previously it was redundant to check IsCompleted first in the loop because it was already checked immediately before before the loop. So I just changed the loop to wait first and check later, now the loop exits right after checking IsCompleted and it would be redundant to check it again before returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dotnet#13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
In #13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
Closes https://github.com/dotnet/coreclr/issues/5928 Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore - UnfairSemaphore had a some benefits: - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order. - Spinning helps with throughput when incoming work is bursty - All of the above benefits were retained in CLRLifoSemaphore and some were improved: - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently. - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners. - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors. Perf results For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%. Updated spin perf code here: dotnet#13670 - NPc = (N * proc count) threads - MPcWi = (M * proc count) work items - BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats - SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero ``` Spin Left score Right score ∆ Score % -------------------------------------------- -------------- -------------- --------- ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi 276.10 ±1.09% 268.90 ±1.36% -2.61% ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi 362.63 ±0.47% 388.82 ±0.33% 7.22% ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi 498.33 ±0.32% 797.01 ±0.29% 59.94% ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi 1222.52 ±0.42% 1348.78 ±0.47% 10.33% ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi 1672.72 ±0.48% 1724.06 ±0.47% 3.07% ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi 1853.94 ±0.25% 1868.36 ±0.45% 0.78% ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi 1849.30 ±0.24% 1902.58 ±0.48% 2.88% ThreadPoolSustainedWorkThroughput 1Pc 1495.62 ±0.78% 1505.89 ±0.20% 0.69% -------------------------------------------- -------------- -------------- --------- Total 922.22 ±0.51% 1004.59 ±0.51% 8.93% ``` Numbers on Linux were similar with a slightly different spread and no regressions. I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
Closes https://github.com/dotnet/coreclr/issues/5928 Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore - UnfairSemaphore had a some benefits: - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order. - Spinning helps with throughput when incoming work is bursty - All of the above benefits were retained in CLRLifoSemaphore and some were improved: - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently. - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners. - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors. Perf results For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%. Updated spin perf code here: dotnet#13670 - NPc = (N * proc count) threads - MPcWi = (M * proc count) work items - BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats - SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero ``` Spin Left score Right score ∆ Score % -------------------------------------------- -------------- -------------- --------- ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi 276.10 ±1.09% 268.90 ±1.36% -2.61% ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi 362.63 ±0.47% 388.82 ±0.33% 7.22% ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi 498.33 ±0.32% 797.01 ±0.29% 59.94% ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi 1222.52 ±0.42% 1348.78 ±0.47% 10.33% ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi 1672.72 ±0.48% 1724.06 ±0.47% 3.07% ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi 1853.94 ±0.25% 1868.36 ±0.45% 0.78% ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi 1849.30 ±0.24% 1902.58 ±0.48% 2.88% ThreadPoolSustainedWorkThroughput 1Pc 1495.62 ±0.78% 1505.89 ±0.20% 0.69% -------------------------------------------- -------------- -------------- --------- Total 922.22 ±0.51% 1004.59 ±0.51% 8.93% ``` Numbers on Linux were similar with a slightly different spread and no regressions. I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
Closes https://github.com/dotnet/coreclr/issues/5928 Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore - UnfairSemaphore had a some benefits: - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order. - Spinning helps with throughput when incoming work is bursty - All of the above benefits were retained in CLRLifoSemaphore and some were improved: - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently. - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners. - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors. Perf results For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%. Updated spin perf code here: dotnet#13670 - NPc = (N * proc count) threads - MPcWi = (M * proc count) work items - BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats - SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero ``` Spin Left score Right score ∆ Score % -------------------------------------------- -------------- -------------- --------- ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi 276.10 ±1.09% 268.90 ±1.36% -2.61% ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi 362.63 ±0.47% 388.82 ±0.33% 7.22% ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi 498.33 ±0.32% 797.01 ±0.29% 59.94% ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi 1222.52 ±0.42% 1348.78 ±0.47% 10.33% ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi 1672.72 ±0.48% 1724.06 ±0.47% 3.07% ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi 1853.94 ±0.25% 1868.36 ±0.45% 0.78% ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi 1849.30 ±0.24% 1902.58 ±0.48% 2.88% ThreadPoolSustainedWorkThroughput 1Pc 1495.62 ±0.78% 1505.89 ±0.20% 0.69% -------------------------------------------- -------------- -------------- --------- Total 922.22 ±0.51% 1004.59 ±0.51% 8.93% ``` Numbers on Linux were similar with a slightly different spread and no regressions. I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.
- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR dotnet#13670 My guess is that these regressions are small and unlikely to materialize into real-world regressions. It would simplify and ease maintenance a bit to remove the asm, but since it looks like the register allocation issues would not be resolved easily, I'm not sure if we want to remove the asm code at this time. @jkotas and @vancem, thoughts? Numbers (no PGO): Windows x64 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 800.56 ±0.33% 821.97 ±0.30% 2.67% MonitorEnterExitLatency 4T 1533.25 ±0.34% 1553.82 ±0.13% 1.34% MonitorEnterExitLatency 7T 1676.14 ±0.26% 1678.14 ±0.18% 0.12% MonitorEnterExitThroughput Delay 1T 5174.77 ±0.25% 5125.56 ±0.27% -0.95% MonitorEnterExitThroughput Delay 2T 4982.38 ±0.22% 4937.79 ±0.19% -0.90% MonitorEnterExitThroughput Delay 4T 4720.41 ±0.37% 4694.09 ±0.24% -0.56% MonitorEnterExitThroughput Delay 7T 3741.20 ±0.33% 3778.06 ±0.20% 0.99% MonitorEnterExitThroughput_AwareLock 1T 63445.04 ±0.20% 61540.28 ±0.23% -3.00% MonitorEnterExitThroughput_ThinLock 1T 59720.83 ±0.20% 59754.62 ±0.12% 0.06% MonitorReliableEnterExitLatency 2T 809.31 ±0.23% 809.58 ±0.41% 0.03% MonitorReliableEnterExitLatency 4T 1569.47 ±0.45% 1577.43 ±0.71% 0.51% MonitorReliableEnterExitLatency 7T 1681.65 ±0.25% 1678.01 ±0.20% -0.22% MonitorReliableEnterExitThroughput Delay 1T 4956.40 ±0.41% 4957.46 ±0.24% 0.02% MonitorReliableEnterExitThroughput Delay 2T 4794.52 ±0.18% 4756.23 ±0.25% -0.80% MonitorReliableEnterExitThroughput Delay 4T 4560.22 ±0.25% 4522.03 ±0.35% -0.84% MonitorReliableEnterExitThroughput Delay 7T 3902.19 ±0.55% 3875.81 ±0.13% -0.68% MonitorReliableEnterExitThroughput_AwareLock 1T 61944.11 ±0.20% 58083.95 ±0.08% -6.23% MonitorReliableEnterExitThroughput_ThinLock 1T 59632.31 ±0.25% 58972.48 ±0.07% -1.11% MonitorTryEnterExitThroughput_AwareLock 1T 62345.13 ±0.14% 57159.99 ±0.14% -8.32% MonitorTryEnterExitThroughput_ThinLock 1T 59725.76 ±0.15% 58050.35 ±0.16% -2.81% ------------------------------------------------ --------------- --------------- --------- Total 6795.49 ±0.28% 6723.21 ±0.23% -1.06% ``` Windows x86 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 958.97 ±0.37% 987.28 ±0.32% 2.95% MonitorEnterExitLatency 4T 1675.18 ±0.41% 1704.64 ±0.08% 1.76% MonitorEnterExitLatency 7T 1825.49 ±0.09% 1769.50 ±0.12% -3.07% MonitorEnterExitThroughput Delay 1T 5083.01 ±0.27% 5047.10 ±0.37% -0.71% MonitorEnterExitThroughput Delay 2T 4854.54 ±0.13% 4825.31 ±0.14% -0.60% MonitorEnterExitThroughput Delay 4T 4628.89 ±0.17% 4579.92 ±0.56% -1.06% MonitorEnterExitThroughput Delay 7T 4125.52 ±0.48% 4096.78 ±0.20% -0.70% MonitorEnterExitThroughput_AwareLock 1T 61841.28 ±0.13% 57429.31 ±0.44% -7.13% MonitorEnterExitThroughput_ThinLock 1T 59746.69 ±0.19% 57971.43 ±0.10% -2.97% MonitorReliableEnterExitLatency 2T 983.26 ±0.22% 998.25 ±0.33% 1.52% MonitorReliableEnterExitLatency 4T 1758.10 ±0.14% 1723.63 ±0.19% -1.96% MonitorReliableEnterExitLatency 7T 1832.24 ±0.08% 1776.61 ±0.10% -3.04% MonitorReliableEnterExitThroughput Delay 1T 5023.19 ±0.05% 4980.49 ±0.08% -0.85% MonitorReliableEnterExitThroughput Delay 2T 4846.04 ±0.03% 4792.58 ±0.11% -1.10% MonitorReliableEnterExitThroughput Delay 4T 4608.14 ±0.09% 4574.90 ±0.06% -0.72% MonitorReliableEnterExitThroughput Delay 7T 4123.20 ±0.10% 4075.92 ±0.11% -1.15% MonitorReliableEnterExitThroughput_AwareLock 1T 57951.11 ±0.11% 57006.12 ±0.21% -1.63% MonitorReliableEnterExitThroughput_ThinLock 1T 58006.06 ±0.18% 58018.28 ±0.07% 0.02% MonitorTryEnterExitThroughput_AwareLock 1T 60701.63 ±0.04% 53374.77 ±0.15% -12.07% MonitorTryEnterExitThroughput_ThinLock 1T 58169.82 ±0.05% 56023.58 ±0.69% -3.69% ------------------------------------------------ --------------- --------------- --------- Total 7037.46 ±0.17% 6906.42 ±0.22% -1.86% ``` Linux x64 ``` Spin repeater Left score Right score ∆ Score % ----------------------------------------------- --------------- --------------- --------- MonitorEnterExitLatency 2T 3755.92 ±1.51% 3802.80 ±0.62% 1.25% MonitorEnterExitLatency 4T 3448.14 ±1.69% 3493.84 ±1.58% 1.33% MonitorEnterExitLatency 7T 2593.97 ±0.13% 2655.21 ±0.15% 2.36% MonitorEnterExitThroughput Delay 1T 4854.52 ±0.12% 4873.08 ±0.11% 0.38% MonitorEnterExitThroughput Delay 2T 4659.19 ±0.85% 4695.61 ±0.38% 0.78% MonitorEnterExitThroughput Delay 4T 4163.01 ±1.46% 4190.94 ±1.37% 0.67% MonitorEnterExitThroughput Delay 7T 3012.69 ±0.45% 3123.75 ±0.32% 3.69% MonitorEnterExitThroughput_AwareLock 1T 56665.09 ±0.16% 58524.86 ±0.24% 3.28% MonitorEnterExitThroughput_ThinLock 1T 57476.36 ±0.68% 57573.08 ±0.61% 0.17% MonitorReliableEnterExitLatency 2T 3952.35 ±0.45% 3937.80 ±0.49% -0.37% MonitorReliableEnterExitLatency 4T 3001.75 ±1.02% 3008.55 ±0.76% 0.23% MonitorReliableEnterExitLatency 7T 2456.20 ±0.65% 2479.78 ±0.09% 0.96% MonitorReliableEnterExitThroughput Delay 1T 4907.10 ±0.85% 4940.83 ±0.23% 0.69% MonitorReliableEnterExitThroughput Delay 2T 4750.81 ±0.62% 4725.81 ±0.87% -0.53% MonitorReliableEnterExitThroughput Delay 4T 4329.93 ±1.18% 4360.67 ±1.04% 0.71% MonitorReliableEnterExitThroughput Delay 7T 3180.52 ±0.27% 3255.88 ±0.51% 2.37% MonitorReliableEnterExitThroughput_AwareLock 1T 54559.89 ±0.09% 55785.74 ±0.20% 2.25% MonitorReliableEnterExitThroughput_ThinLock 1T 55936.06 ±0.36% 55519.74 ±0.80% -0.74% MonitorTryEnterExitThroughput_AwareLock 1T 52694.96 ±0.18% 54282.77 ±0.12% 3.01% MonitorTryEnterExitThroughput_ThinLock 1T 54942.18 ±0.24% 55031.84 ±0.38% 0.16% ----------------------------------------------- --------------- --------------- --------- Total 8326.45 ±0.65% 8420.07 ±0.54% 1.12% ```
- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR dotnet#13670 Numbers (no PGO): Windows x64 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 800.56 ±0.33% 821.97 ±0.30% 2.67% MonitorEnterExitLatency 4T 1533.25 ±0.34% 1553.82 ±0.13% 1.34% MonitorEnterExitLatency 7T 1676.14 ±0.26% 1678.14 ±0.18% 0.12% MonitorEnterExitThroughput Delay 1T 5174.77 ±0.25% 5125.56 ±0.27% -0.95% MonitorEnterExitThroughput Delay 2T 4982.38 ±0.22% 4937.79 ±0.19% -0.90% MonitorEnterExitThroughput Delay 4T 4720.41 ±0.37% 4694.09 ±0.24% -0.56% MonitorEnterExitThroughput Delay 7T 3741.20 ±0.33% 3778.06 ±0.20% 0.99% MonitorEnterExitThroughput_AwareLock 1T 63445.04 ±0.20% 61540.28 ±0.23% -3.00% MonitorEnterExitThroughput_ThinLock 1T 59720.83 ±0.20% 59754.62 ±0.12% 0.06% MonitorReliableEnterExitLatency 2T 809.31 ±0.23% 809.58 ±0.41% 0.03% MonitorReliableEnterExitLatency 4T 1569.47 ±0.45% 1577.43 ±0.71% 0.51% MonitorReliableEnterExitLatency 7T 1681.65 ±0.25% 1678.01 ±0.20% -0.22% MonitorReliableEnterExitThroughput Delay 1T 4956.40 ±0.41% 4957.46 ±0.24% 0.02% MonitorReliableEnterExitThroughput Delay 2T 4794.52 ±0.18% 4756.23 ±0.25% -0.80% MonitorReliableEnterExitThroughput Delay 4T 4560.22 ±0.25% 4522.03 ±0.35% -0.84% MonitorReliableEnterExitThroughput Delay 7T 3902.19 ±0.55% 3875.81 ±0.13% -0.68% MonitorReliableEnterExitThroughput_AwareLock 1T 61944.11 ±0.20% 58083.95 ±0.08% -6.23% MonitorReliableEnterExitThroughput_ThinLock 1T 59632.31 ±0.25% 58972.48 ±0.07% -1.11% MonitorTryEnterExitThroughput_AwareLock 1T 62345.13 ±0.14% 57159.99 ±0.14% -8.32% MonitorTryEnterExitThroughput_ThinLock 1T 59725.76 ±0.15% 58050.35 ±0.16% -2.81% ------------------------------------------------ --------------- --------------- --------- Total 6795.49 ±0.28% 6723.21 ±0.23% -1.06% ``` Windows x86 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 958.97 ±0.37% 987.28 ±0.32% 2.95% MonitorEnterExitLatency 4T 1675.18 ±0.41% 1704.64 ±0.08% 1.76% MonitorEnterExitLatency 7T 1825.49 ±0.09% 1769.50 ±0.12% -3.07% MonitorEnterExitThroughput Delay 1T 5083.01 ±0.27% 5047.10 ±0.37% -0.71% MonitorEnterExitThroughput Delay 2T 4854.54 ±0.13% 4825.31 ±0.14% -0.60% MonitorEnterExitThroughput Delay 4T 4628.89 ±0.17% 4579.92 ±0.56% -1.06% MonitorEnterExitThroughput Delay 7T 4125.52 ±0.48% 4096.78 ±0.20% -0.70% MonitorEnterExitThroughput_AwareLock 1T 61841.28 ±0.13% 57429.31 ±0.44% -7.13% MonitorEnterExitThroughput_ThinLock 1T 59746.69 ±0.19% 57971.43 ±0.10% -2.97% MonitorReliableEnterExitLatency 2T 983.26 ±0.22% 998.25 ±0.33% 1.52% MonitorReliableEnterExitLatency 4T 1758.10 ±0.14% 1723.63 ±0.19% -1.96% MonitorReliableEnterExitLatency 7T 1832.24 ±0.08% 1776.61 ±0.10% -3.04% MonitorReliableEnterExitThroughput Delay 1T 5023.19 ±0.05% 4980.49 ±0.08% -0.85% MonitorReliableEnterExitThroughput Delay 2T 4846.04 ±0.03% 4792.58 ±0.11% -1.10% MonitorReliableEnterExitThroughput Delay 4T 4608.14 ±0.09% 4574.90 ±0.06% -0.72% MonitorReliableEnterExitThroughput Delay 7T 4123.20 ±0.10% 4075.92 ±0.11% -1.15% MonitorReliableEnterExitThroughput_AwareLock 1T 57951.11 ±0.11% 57006.12 ±0.21% -1.63% MonitorReliableEnterExitThroughput_ThinLock 1T 58006.06 ±0.18% 58018.28 ±0.07% 0.02% MonitorTryEnterExitThroughput_AwareLock 1T 60701.63 ±0.04% 53374.77 ±0.15% -12.07% MonitorTryEnterExitThroughput_ThinLock 1T 58169.82 ±0.05% 56023.58 ±0.69% -3.69% ------------------------------------------------ --------------- --------------- --------- Total 7037.46 ±0.17% 6906.42 ±0.22% -1.86% ``` Linux x64 ``` Spin repeater Left score Right score ∆ Score % ----------------------------------------------- --------------- --------------- --------- MonitorEnterExitLatency 2T 3755.92 ±1.51% 3802.80 ±0.62% 1.25% MonitorEnterExitLatency 4T 3448.14 ±1.69% 3493.84 ±1.58% 1.33% MonitorEnterExitLatency 7T 2593.97 ±0.13% 2655.21 ±0.15% 2.36% MonitorEnterExitThroughput Delay 1T 4854.52 ±0.12% 4873.08 ±0.11% 0.38% MonitorEnterExitThroughput Delay 2T 4659.19 ±0.85% 4695.61 ±0.38% 0.78% MonitorEnterExitThroughput Delay 4T 4163.01 ±1.46% 4190.94 ±1.37% 0.67% MonitorEnterExitThroughput Delay 7T 3012.69 ±0.45% 3123.75 ±0.32% 3.69% MonitorEnterExitThroughput_AwareLock 1T 56665.09 ±0.16% 58524.86 ±0.24% 3.28% MonitorEnterExitThroughput_ThinLock 1T 57476.36 ±0.68% 57573.08 ±0.61% 0.17% MonitorReliableEnterExitLatency 2T 3952.35 ±0.45% 3937.80 ±0.49% -0.37% MonitorReliableEnterExitLatency 4T 3001.75 ±1.02% 3008.55 ±0.76% 0.23% MonitorReliableEnterExitLatency 7T 2456.20 ±0.65% 2479.78 ±0.09% 0.96% MonitorReliableEnterExitThroughput Delay 1T 4907.10 ±0.85% 4940.83 ±0.23% 0.69% MonitorReliableEnterExitThroughput Delay 2T 4750.81 ±0.62% 4725.81 ±0.87% -0.53% MonitorReliableEnterExitThroughput Delay 4T 4329.93 ±1.18% 4360.67 ±1.04% 0.71% MonitorReliableEnterExitThroughput Delay 7T 3180.52 ±0.27% 3255.88 ±0.51% 2.37% MonitorReliableEnterExitThroughput_AwareLock 1T 54559.89 ±0.09% 55785.74 ±0.20% 2.25% MonitorReliableEnterExitThroughput_ThinLock 1T 55936.06 ±0.36% 55519.74 ±0.80% -0.74% MonitorTryEnterExitThroughput_AwareLock 1T 52694.96 ±0.18% 54282.77 ±0.12% 3.01% MonitorTryEnterExitThroughput_ThinLock 1T 54942.18 ±0.24% 55031.84 ±0.38% 0.16% ----------------------------------------------- --------------- --------------- --------- Total 8326.45 ±0.65% 8420.07 ±0.54% 1.12% ```
- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR #13670
Part of fix for https://github.com/dotnet/coreclr/issues/13388
Normalized equivalent of YieldProcessor
Thread.SpinWait divide count by 7 experiment
Spin tuning