Add normalized equivalent of YieldProcessor, retune some spin loops #13670

kouvel · 2017-08-30T00:06:14Z

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor

The delay incurred by YieldProcessor is measured once lazily at run-time
Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment

At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
- I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
- ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning

At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
General observations
- YieldProcessor stage
  - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
  - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
  - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
  - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
- Yield/Sleep(0) stage
  - Observed a couple of issues:
    - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
    - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
  - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
- Sleep(1) stage
  - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
    - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
  - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
    - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
    - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
    - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
SpinWait - Tweaked the spin strategy as mentioned above
ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
- On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
- On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

kouvel · 2017-08-30T00:07:31Z

Numbers from Thread.SpinWait divide count by 7 experiment:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score       ∆ Score %
------------------------------------------  ----------------  ----------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%     146.93 ±0.32%     11.61%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  16040.14 ±11.77%     -0.32%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%   38814.31 ±0.95%     -1.66%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%    6872.57 ±0.76%     -3.05%
MresWaitDrainRate 1Pc                          556.26 ±0.71%     555.32 ±0.51%     -0.17%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%     525.69 ±0.67%      0.24%
MresWaitDrainRate 2Pc                          679.15 ±0.52%     683.60 ±1.00%      0.66%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%     673.72 ±0.98%      0.84%
MresWaitLatency 1Pc                            505.88 ±0.38%     497.54 ±0.88%     -1.65%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%     442.81 ±0.71%      0.00%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%     110.93 ±1.42%     -2.81%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%     118.51 ±0.48%      0.71%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%      76.91 ±0.88%      8.43%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%     100.60 ±0.78%      9.99%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%    183.44 ±25.25%    -59.77%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%      71.86 ±1.89%      0.77%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%      69.88 ±1.63%      2.27%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%      91.41 ±2.91%     -2.26%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%      90.84 ±0.71%      4.55%
SpinLockLatency 1Pc                            286.86 ±1.00%     284.45 ±0.95%     -0.84%
SpinLockLatency 1Pc Delay                      216.97 ±0.72%     212.25 ±0.69%     -2.18%
SpinLockLatency 2Pc                            142.92 ±2.09%     149.15 ±1.39%      4.36%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%      80.89 ±4.59%      6.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%   44630.26 ±0.33%     -0.44%
------------------------------------------  ----------------  ----------------  ---------
Total                                          418.59 ±5.39%     408.67 ±2.77%     -2.37%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   22543.68 ±0.44%     -3.04%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22974.79 ±0.52%     -1.16%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22638.62 ±0.79%     -1.57%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   16368.50 ±5.28%      3.66%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23679.77 ±0.70%      7.60%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23412.23 ±1.16%      8.61%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23823.90 ±0.76%     12.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   21531.70 ±1.55%     -6.12%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   11174.93 ±6.12%      2.52%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%   6809.70 ±10.01%     -5.93%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   13822.82 ±7.76%    -15.29%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   10170.40 ±9.35%    -28.21%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   15157.21 ±7.66%      0.78%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%  10408.31 ±18.95%      9.26%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   3812.76 ±34.55%    -49.10%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   17888.85 ±9.92%    117.15%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   17661.60 ±7.00%     14.14%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   2945.33 ±30.30%     81.66%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   6391.88 ±26.82%     -4.96%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%  15424.21 ±10.83%     -2.18%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   17318.19 ±6.69%      9.27%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%   20514.35 ±3.81%     -4.80%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   13773.15 ±9.71%      2.50%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   9364.70 ±1.50%     64.47%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  25803.97 ±6.12%    -31.92%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  48910.94 ±0.17%      3.79%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  13560.28 ±1.37%    -60.43%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    699.29 ±0.35%    106.68%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    657.57 ±0.49%     92.19%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    984.25 ±0.47%     55.16%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    884.72 ±0.48%     44.33%
MresWaitLatency 1Pc                           414.26 ±0.49%    610.93 ±0.36%     47.48%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    578.27 ±0.31%     27.21%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    253.97 ±1.18%    -27.80%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    167.06 ±0.82%    -19.52%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%     47.84 ±6.23%     -7.88%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%     31.28 ±4.39%    -32.71%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%  14531.60 ±1.36%      1.13%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%     65.49 ±1.59%    211.30%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%     61.86 ±1.77%    190.74%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%     86.66 ±0.47%    239.85%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.19 ±0.43%     83.42 ±0.47%    231.10%
SpinLockLatency 1Pc                           337.00 ±0.47%    392.09 ±0.72%     16.35%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    342.38 ±1.18%      4.71%
SpinLockLatency 2Pc                           164.61 ±2.36%    173.41 ±2.06%      5.35%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    148.05 ±3.77%     -0.24%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  58856.20 ±0.48%      6.20%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.33 ±1.42%    687.48 ±1.60%     28.18%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   25137.00 ±1.24%     -8.52%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   27044.91 ±0.27%     -1.53%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26741.52 ±0.67%      0.30%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   26194.72 ±0.56%      0.51%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   25318.99 ±1.46%     -6.44%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   22894.77 ±3.36%     -2.97%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   22369.25 ±2.08%    -17.84%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   20954.91 ±3.54%     20.07%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9596.75 ±7.24%     23.99%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   6436.92 ±16.55%     36.53%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%  11517.93 ±11.15%     21.37%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    9158.23 ±8.38%     70.23%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   5534.47 ±15.56%     -0.74%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%   8558.61 ±27.37%      9.15%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   5121.77 ±32.87%     17.59%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5366.95 ±20.87%     94.44%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%   4958.81 ±26.60%      0.57%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%    229.29 ±11.63%    -97.64%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%     185.42 ±9.17%    -96.72%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%    716.59 ±50.04%    -84.15%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   1404.82 ±41.50%    -69.79%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   5082.68 ±24.73%    -13.01%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   6630.10 ±15.71%    -31.69%

kouvel · 2017-08-30T00:08:20Z

Numbers from this PR:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score        Right score      ∆ Score %
------------------------------------------  ----------------  ---------------  ---------
BarrierSyncRate 1Pc                            131.65 ±0.13%    156.62 ±0.09%     18.97%
ConcurrentQueueThroughput 1Pc               16092.28 ±12.34%  37180.17 ±1.16%    131.04%
ConcurrentStackThroughput 1Pc                39470.21 ±0.42%  40290.88 ±0.94%      2.08%
CountdownEventAddCountSignalThroughput 1Pc    7088.57 ±1.33%  33169.51 ±1.76%    367.93%
MresWaitDrainRate 1Pc                          556.26 ±0.71%    630.82 ±0.42%     13.40%
MresWaitDrainRate 1Pc Delay                    524.40 ±1.22%    591.94 ±1.39%     12.88%
MresWaitDrainRate 2Pc                          679.15 ±0.52%    787.68 ±1.46%     15.98%
MresWaitDrainRate 2Pc Delay                    668.11 ±1.76%    793.68 ±0.27%     18.80%
MresWaitLatency 1Pc                            505.88 ±0.38%    571.14 ±0.55%     12.90%
MresWaitLatency 1Pc Delay                      442.80 ±0.51%    563.03 ±0.47%     27.15%
SemaphoreSlimLatency 1Pc                       114.14 ±0.50%    228.34 ±1.02%    100.06%
SemaphoreSlimLatency 1Pc Delay                 117.67 ±0.49%    170.44 ±3.62%     44.84%
SemaphoreSlimLatency 2Pc                        70.93 ±1.04%    220.51 ±1.75%    210.89%
SemaphoreSlimLatency 2Pc Delay                  91.47 ±1.90%    142.45 ±2.65%     55.74%
SemaphoreSlimThroughput 1Pc                   455.94 ±60.07%   914.93 ±12.88%    100.67%
SemaphoreSlimWaitDrainRate 1Pc                  71.31 ±0.95%    474.65 ±7.27%    565.60%
SemaphoreSlimWaitDrainRate 1Pc Delay            68.33 ±2.13%   293.44 ±18.24%    329.45%
SemaphoreSlimWaitDrainRate 2Pc                  93.52 ±2.42%    597.74 ±3.01%    539.17%
SemaphoreSlimWaitDrainRate 2Pc Delay            86.89 ±2.17%    605.61 ±2.29%    596.97%
SpinLockLatency 1Pc                            286.86 ±1.00%    291.97 ±1.29%      1.78%
SpinLockLatency 1Pc Delay                      217.72 ±0.81%    209.04 ±2.73%     -3.99%
SpinLockLatency 2Pc                            142.92 ±2.09%    269.79 ±1.03%     88.77%
SpinLockLatency 2Pc Delay                       75.96 ±4.76%    179.64 ±1.71%    136.49%
SpinLockThroughput 1Pc                       44828.02 ±0.48%  48026.99 ±0.77%      7.14%
------------------------------------------  ----------------  ---------------  ---------
Total                                          418.65 ±5.39%    799.70 ±2.96%     91.02%

RwSB vs RwS                         Left score       Right score       ∆ Score %
----------------------------------  ---------------  ----------------  ---------
Concurrency_OnlyReadersPcx01        23249.72 ±0.26%   23281.16 ±0.11%      0.14%
Concurrency_OnlyReadersPcx04        23244.97 ±0.11%   22884.67 ±0.17%     -1.55%
Concurrency_OnlyReadersPcx16        22999.13 ±0.06%   22711.34 ±0.13%     -1.25%
Concurrency_OnlyReadersPcx64        15791.12 ±4.44%   15103.34 ±2.92%     -4.36%
Concurrency_OnlyWritersPcx01        22007.28 ±0.97%   23716.49 ±0.33%      7.77%
Concurrency_OnlyWritersPcx04        21556.55 ±1.41%   23451.14 ±0.42%      8.79%
Concurrency_OnlyWritersPcx16        21269.57 ±1.14%   23611.47 ±0.40%     11.01%
Concurrency_OnlyWritersPcx64        22935.28 ±1.47%   22004.41 ±0.59%     -4.06%
Concurrency_Pcx01Readers_01Writers  10900.17 ±2.26%   10834.01 ±4.16%     -0.61%
Concurrency_Pcx01Readers_02Writers  7239.27 ±12.43%    7426.22 ±8.20%      2.58%
Concurrency_Pcx04Readers_01Writers  16317.54 ±3.14%   16688.11 ±2.54%      2.27%
Concurrency_Pcx04Readers_02Writers  14166.41 ±5.33%   12205.80 ±3.45%    -13.84%
Concurrency_Pcx04Readers_04Writers  15039.61 ±5.99%   10169.08 ±3.46%    -32.38%
Concurrency_Pcx16Readers_01Writers  9526.18 ±17.75%   16779.35 ±5.23%     76.14%
Concurrency_Pcx16Readers_02Writers  7491.28 ±17.99%   14328.17 ±5.72%     91.26%
Concurrency_Pcx16Readers_04Writers  8238.04 ±18.55%   12386.79 ±5.58%     50.36%
Concurrency_Pcx16Readers_08Writers  15473.47 ±7.52%   14103.08 ±8.87%     -8.86%
Concurrency_Pcx64Readers_01Writers   1621.35 ±7.95%   15344.30 ±6.25%    846.39%
Concurrency_Pcx64Readers_02Writers  6725.27 ±21.51%   13493.06 ±5.50%    100.63%
Concurrency_Pcx64Readers_04Writers  15767.55 ±7.69%   11061.25 ±6.63%    -29.85%
Concurrency_Pcx64Readers_08Writers  15849.51 ±6.85%   12762.56 ±9.17%    -19.48%
Concurrency_Pcx64Readers_16Writers  21549.13 ±3.39%  12718.30 ±11.93%    -40.98%
----------------------------------  ---------------  ----------------  ---------
Total                               13437.83 ±6.98%   15420.22 ±4.23%     14.75%

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score       Right score      ∆ Score %
------------------------------------------  ---------------  ---------------  ---------
BarrierSyncRate 1Pc                          5693.90 ±3.09%   7549.66 ±6.62%     32.59%
ConcurrentQueueThroughput 1Pc               37904.54 ±4.21%  47712.92 ±1.50%     25.88%
ConcurrentStackThroughput 1Pc               47125.33 ±0.11%  49046.62 ±0.26%      4.08%
CountdownEventAddCountSignalThroughput 1Pc  34265.28 ±0.53%  24094.85 ±3.12%    -29.68%
MresWaitDrainRate 1Pc                         338.35 ±0.82%    781.15 ±0.25%    130.87%
MresWaitDrainRate 1Pc Delay                   342.14 ±0.50%    774.52 ±0.49%    126.37%
MresWaitDrainRate 2Pc                         634.33 ±0.19%    981.16 ±0.15%     54.68%
MresWaitDrainRate 2Pc Delay                   612.98 ±0.09%    964.77 ±0.43%     57.39%
MresWaitLatency 1Pc                           414.26 ±0.49%    890.84 ±0.79%    115.04%
MresWaitLatency 1Pc Delay                     454.59 ±1.06%    844.91 ±0.43%     85.86%
SemaphoreSlimLatency 1Pc                      351.74 ±0.48%    285.31 ±1.06%    -18.89%
SemaphoreSlimLatency 1Pc Delay                207.57 ±1.14%    234.78 ±2.05%     13.11%
SemaphoreSlimLatency 2Pc                       51.93 ±3.80%    280.19 ±1.01%    439.55%
SemaphoreSlimLatency 2Pc Delay                 46.49 ±3.01%    226.19 ±2.62%    386.58%
SemaphoreSlimThroughput 1Pc                 14368.74 ±0.89%   5504.80 ±3.43%    -61.69%
SemaphoreSlimWaitDrainRate 1Pc                 21.04 ±1.99%   176.38 ±18.92%    738.46%
SemaphoreSlimWaitDrainRate 1Pc Delay           21.28 ±2.45%   130.11 ±21.55%    511.58%
SemaphoreSlimWaitDrainRate 2Pc                 25.50 ±0.43%    517.83 ±0.15%   1930.81%
SemaphoreSlimWaitDrainRate 2Pc Delay           25.41 ±0.31%    466.44 ±0.50%   1735.61%
SpinLockLatency 1Pc                           337.00 ±0.47%    410.08 ±1.63%     21.68%
SpinLockLatency 1Pc Delay                     326.97 ±1.27%    347.44 ±2.04%      6.26%
SpinLockLatency 2Pc                           164.61 ±2.36%    357.89 ±2.32%    117.42%
SpinLockLatency 2Pc Delay                     148.40 ±3.75%    321.58 ±1.31%    116.69%
SpinLockThroughput 1Pc                      55420.72 ±0.32%  61147.67 ±0.72%     10.33%
------------------------------------------  ---------------  ---------------  ---------
Total                                         536.52 ±1.41%   1141.26 ±3.22%    112.71%

RwSB vs RwS                         Left score        Right score       ∆ Score %
----------------------------------  ----------------  ----------------  ---------
Concurrency_OnlyReadersPcx01         27479.34 ±0.15%   27099.63 ±0.26%     -1.38%
Concurrency_OnlyReadersPcx04         27464.91 ±0.17%   26101.59 ±0.85%     -4.96%
Concurrency_OnlyReadersPcx16         26662.72 ±0.52%   26892.08 ±0.16%      0.86%
Concurrency_OnlyReadersPcx64         26062.34 ±0.37%   25022.53 ±0.32%     -3.99%
Concurrency_OnlyWritersPcx01         27062.37 ±1.15%   28764.28 ±0.36%      6.29%
Concurrency_OnlyWritersPcx04         23594.37 ±3.73%   28707.49 ±0.29%     21.67%
Concurrency_OnlyWritersPcx16         27225.09 ±1.94%   24213.02 ±5.62%    -11.06%
Concurrency_OnlyWritersPcx64        17451.93 ±11.31%   26971.54 ±1.39%     54.55%
Concurrency_Pcx01Readers_01Writers   7739.63 ±10.32%    9620.09 ±8.70%     24.30%
Concurrency_Pcx01Readers_02Writers   4714.65 ±14.40%   11722.87 ±8.26%    148.65%
Concurrency_Pcx04Readers_01Writers   9490.11 ±12.25%   11005.10 ±7.46%     15.96%
Concurrency_Pcx04Readers_02Writers   5379.94 ±17.06%    7972.85 ±9.15%     48.20%
Concurrency_Pcx04Readers_04Writers   5575.55 ±25.88%   10421.93 ±9.10%     86.92%
Concurrency_Pcx16Readers_01Writers   7841.46 ±16.61%  14165.26 ±10.45%     80.65%
Concurrency_Pcx16Readers_02Writers   4355.57 ±12.71%   9627.95 ±12.92%    121.05%
Concurrency_Pcx16Readers_04Writers   2760.20 ±15.98%   5689.54 ±27.34%    106.13%
Concurrency_Pcx16Readers_08Writers   4930.88 ±26.49%  12242.60 ±13.48%    148.28%
Concurrency_Pcx64Readers_01Writers   9728.04 ±17.06%  10822.92 ±20.22%     11.25%
Concurrency_Pcx64Readers_02Writers   5646.32 ±14.26%   8372.09 ±19.54%     48.28%
Concurrency_Pcx64Readers_04Writers    4520.20 ±5.61%   8471.40 ±21.00%     87.41%
Concurrency_Pcx64Readers_08Writers    4649.90 ±8.66%   6870.99 ±22.40%     47.77%
Concurrency_Pcx64Readers_16Writers   5842.78 ±14.01%   8183.58 ±10.72%     40.06%
----------------------------------  ----------------  ----------------  ---------
Total                                9705.62 ±10.83%   13739.79 ±9.92%     41.57%

stephentoub · 2017-08-30T01:39:13Z

src/mscorlib/shared/System/Threading/SpinWait.cs

+        /// when the resource becomes available.
+        /// </summary>
+        internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;
+        internal const int Sleep1ThresholdForSpinBeforeWait = 40; // should be greater than MaxSpinCountBeforeWait


What is MaxSpinCountBeforeWait?

Oops renamed that one, will fix

stephentoub · 2017-08-30T01:40:22Z

src/mscorlib/shared/System/Threading/SpinWait.cs

+            // (_count - YieldThreshold) % 2 == 0: The purpose of this check is to interleave Thread.Yield/Sleep(0) with
+            // Thread.SpinWait. Otherwise, the following issues occur:
+            //   - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a
+            //     busy -spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may


Nit: extra space in "busy -spin"

stephentoub · 2017-08-30T01:40:45Z

src/mscorlib/shared/System/Threading/SpinWait.cs

+            //     contention), they may switch between one another, delaying work that can make progress.
+            if ((
+                    _count >= YieldThreshold &&
+                    (_count >= sleep1Threshold || (_count - YieldThreshold) % 2 == 0)


Nit: the formatting here reads strangely to me

It's formatted similarly to:

if (a || b)

where

a == ( c && d )

This is how I typically format multi-line expressions, trying to align parentheses and putting each type of expression (&& or ||) separately, one condition per line unless the whole expression fits on one line. What would you suggest instead? I can separate parts of it into locals if you prefer.

stephentoub · 2017-08-30T01:50:39Z

Thanks, @kouvel. Do you have any throughput numbers on the thread pool with this change?

kouvel · 2017-08-30T02:12:00Z

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

kouvel · 2017-08-30T02:14:12Z

Code used for Spin perf:

using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;

internal class Program
{
    private static readonly int ProcessorCount = Environment.ProcessorCount;

    private static void Main(string[] args)
    {
        int ai = 1;
        int threadCount;
        if (args[ai].EndsWith("PcT"))
        {
            double pcMultiplier;
            if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcT".Length), out pcMultiplier))
                return;
            threadCount = Math.Max(1, (int)Math.Round(ProcessorCount * pcMultiplier));
        }
        else if (args[ai].EndsWith("T"))
        {
            if (!int.TryParse(args[ai].Substring(0, args[ai].Length - "T".Length), out threadCount))
                return;
        }
        else
            return;
        ++ai;

        switch (args[0])
        {
            case "MresWaitDrainRate":
                MresWaitDrainRate(threadCount);
                break;
            case "MresWaitLatency":
                MresWaitLatency(threadCount);
                break;
            case "SemaphoreSlimWaitDrainRate":
                SemaphoreSlimWaitDrainRate(threadCount);
                break;
            case "SemaphoreSlimLatency":
                SemaphoreSlimLatency(threadCount);
                break;
            case "SemaphoreSlimThroughput":
                SemaphoreSlimThroughput(threadCount);
                break;
            case "SpinLockLatency":
                SpinLockLatency(threadCount);
                break;
            case "SpinLockThroughput":
                SpinLockThroughput(threadCount);
                break;
            case "ConcurrentBagThroughput":
                ConcurrentBagThroughput(threadCount);
                break;
            case "ConcurrentBagFairness":
                ConcurrentBagFairness(threadCount);
                break;
            case "ConcurrentQueueThroughput":
                ConcurrentQueueThroughput(threadCount);
                break;
            case "ConcurrentQueueFairness":
                ConcurrentQueueFairness(threadCount);
                break;
            case "ConcurrentStackThroughput":
                ConcurrentStackThroughput(threadCount);
                break;
            case "ConcurrentStackFairness":
                ConcurrentStackFairness(threadCount);
                break;
            case "BarrierSyncRate":
                BarrierSyncRate(threadCount);
                break;
            case "CountdownEventSyncRate":
                CountdownEventSyncRate(threadCount);
                break;
            case "ThreadPoolSustainedWorkThroughput":
                ThreadPoolSustainedWorkThroughput(threadCount);
                break;
            case "ThreadPoolBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    ThreadPoolBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "TaskSustainedWorkThroughput":
                TaskSustainedWorkThroughput(threadCount);
                break;
            case "TaskBurstWorkThroughput":
                {
                    if (ai >= args.Length || !args[ai].EndsWith("PcWi"))
                        return;
                    double workItemCountPcMultiplier;
                    if (!double.TryParse(args[ai].Substring(0, args[ai].Length - "PcWi".Length), out workItemCountPcMultiplier))
                        return;
                    int maxWorkItemCount = Math.Max(1, (int)Math.Round(ProcessorCount * workItemCountPcMultiplier));

                    TaskBurstWorkThroughput(threadCount, maxWorkItemCount);
                    break;
                }
            case "MonitorEnterExitThroughput_ThinLock":
                MonitorEnterExitThroughput(1, false, false);
                break;
            case "MonitorEnterExitThroughput_AwareLock":
                MonitorEnterExitThroughput(1, false, true);
                break;
            case "MonitorReliableEnterExitThroughput_ThinLock":
                MonitorReliableEnterExitThroughput(1, false, false);
                break;
            case "MonitorReliableEnterExitThroughput_AwareLock":
                MonitorReliableEnterExitThroughput(1, false, true);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_ThinLock":
                MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterExitWhenUnlockedThroughput_AwareLock":
                MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_ThinLock":
                MonitorTryEnterWhenLockedThroughput_ThinLock(1);
                break;
            case "MonitorTryEnterWhenLockedThroughput_AwareLock":
                MonitorTryEnterWhenLockedThroughput_AwareLock(1);
                break;
            case "MonitorReliableEnterExitLatency":
                MonitorReliableEnterExitLatency(threadCount);
                break;
            case "MonitorEnterExitThroughput":
                MonitorEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExitThroughput":
                MonitorReliableEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorTryEnterExitThroughput":
                MonitorTryEnterExitThroughput(threadCount, true, false);
                break;
            case "MonitorReliableEnterExit1PcTOtherWorkThroughput":
                MonitorReliableEnterExit1PcTOtherWorkThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitRoundRobinThroughput":
                MonitorReliableEnterExitRoundRobinThroughput(threadCount);
                break;
            case "MonitorReliableEnterExitFairness":
                MonitorReliableEnterExitFairness(threadCount);
                break;
            case "BufferMemoryCopyThroughput":
                BufferMemoryCopyThroughput(threadCount);
                break;
        }
    }

    [ThreadStatic]
    private static Random t_rng;

    private static void MresWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            var allWaitersWoken = allWaitersWoken0;
            while (true)
            {
                e.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    e.Reset();
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                e.Set();
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MresWaitLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var continueWaitThreads = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var e = new ManualResetEventSlim(false);

        ParameterizedThreadStart waitThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            startTest.WaitOne();
            while (true)
            {
                e.Wait();
                ++localThreadOperationCounts[threadIndex];
                continueWaitThreads.WaitOne();
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var rng = new Random(0);
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                continueWaitThreads.Reset();
                e.Set();
                e.Reset();
                continueWaitThreads.Set();
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimWaitDrainRate(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var allWaitersWoken0 = new ManualResetEvent(false);
        var allWaitersWoken1 = new ManualResetEvent(false);
        int waiterWokenCount = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(0);

        ThreadStart waitThreadStart = () =>
        {
            var localThreadCount = threadCount;
            var localThreadOperationCounts = threadOperationCounts;
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                ss.Wait();
                if (Interlocked.Increment(ref waiterWokenCount) == localThreadCount)
                {
                    ++localThreadOperationCounts[16];
                    waiterWokenCount = 0;
                    (allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0).Reset();
                    allWaitersWoken.Set();
                }
                else
                    allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        };
        var waitThreads = new Thread[threadCount];
        for (int i = 0; i < waitThreads.Length; ++i)
        {
            var t = new Thread(waitThreadStart);
            t.IsBackground = true;
            t.Start();
            waitThreads[i] = t;
        }

        var signalThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var rng = new Random(0);
            var allWaitersWoken = allWaitersWoken0;
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ss.Release(localThreadCount);
                allWaitersWoken.WaitOne();
                allWaitersWoken = allWaitersWoken == allWaitersWoken0 ? allWaitersWoken1 : allWaitersWoken0;
            }
        });
        signalThread.IsBackground = true;
        signalThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void SemaphoreSlimLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                previousLockThreadId = threadId;
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SemaphoreSlimThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var ss = new SemaphoreSlim(1);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                ss.Wait();
                Delay(d0);
                ss.Release();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockLatency(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                previousLockThreadId = threadId;
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void SpinLockThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new SpinLock(enableThreadOwnerTracking: false);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(threadIndex);
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                bool lockTaken = false;
                m.Enter(ref lockTaken);
                Delay(d0);
                m.Exit();
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCb.Add(threadId);
                Delay(d0);
                int item;
                localCb.TryTake(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentBagFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cb = new ConcurrentBag<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCb = cb;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCb.Add(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCb.TryTake(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentQueueThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCq.Enqueue(threadId);
                Delay(d0);
                int item;
                localCq.TryDequeue(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentQueueFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cq = new ConcurrentQueue<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCq = cq;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCq.Enqueue(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCq.TryDequeue(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void ConcurrentStackThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                localCs.Push(threadId);
                Delay(d0);
                int item;
                localCs.TryPop(out item);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ConcurrentStackFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var cs = new ConcurrentStack<int>();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localCs = cs;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                localCs.Push(threadId);
                var stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d0);

                startTicks = Clock.Ticks;
                int item;
                localCs.TryPop(out item);
                stopTicks = Clock.Ticks;
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void BarrierSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var b = new Barrier(threadCount);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localB = b;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localB.SignalAndWait();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void CountdownEventSyncRate(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var delayComplete0 = new ManualResetEvent(false);
        var delayComplete1 = new ManualResetEvent(false);
        int syncThreadCount = threadCount;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var cde = new CountdownEvent(threadCount * 2);

        var rng = new Random(0);
        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadCount = threadCount;
            var localDelayComplete0 = delayComplete0;
            var localDelayComplete1 = delayComplete1;
            var localThreadOperationCounts = threadOperationCounts;
            var localCde = cde;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                localCde.Signal(2);
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    syncThreadCount = localThreadCount;
                    localDelayComplete1.Reset();
                    localDelayComplete0.Set();
                }
                else
                    localDelayComplete0.WaitOne();
                if (Interlocked.Decrement(ref syncThreadCount) == 0)
                {
                    ++localThreadOperationCounts[16];
                    Delay(RandomShortDelay(rng));
                    syncThreadCount = localThreadCount;
                    localCde.Reset(localThreadCount * 2);
                    localDelayComplete0.Reset();
                    localDelayComplete1.Set();
                }
                else
                    localDelayComplete1.WaitOne();
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void ThreadPoolSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            ThreadPool.QueueUserWorkItem(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        var producerThread = new Thread(() =>
        {
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount * 4;
            for (int i = 0; i < initialWorkItemCount; ++i)
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void ThreadPoolBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var workComplete = new AutoResetEvent(false);
        int workItemCountToQueue = 0;
        int workItemCountToComplete = 0;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        WaitCallback workItemStart = null;
        workItemStart = data =>
        {
            int n = Interlocked.Add(ref workItemCountToQueue, -2);
            if (n >= -1)
            {
                var localWorkItemStart = workItemStart;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                if (n >= 0)
                    ThreadPool.QueueUserWorkItem(localWorkItemStart);
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (Interlocked.Decrement(ref workItemCountToComplete) == 0)
                workComplete.Set();
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                workItemCountToQueue = localMaxWorkItemCount - 1;
                workItemCountToComplete = localMaxWorkItemCount;
                ThreadPool.QueueUserWorkItem(localWorkItemStart);
                workComplete.WaitOne();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskSustainedWorkThroughput(int threadCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action workItemStart = null;
        workItemStart = () =>
        {
            Task.Run(workItemStart);
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
        };

        Action initialWorkItemStart = () =>
        {
            var localWorkItemStart = workItemStart;
            for (int i = 0; i < 4; ++i)
                Task.Run(localWorkItemStart);
        };

        var producerThread = new Thread(() =>
        {
            var localThreadCount = threadCount;
            var localInitialWorkItemStart = initialWorkItemStart;
            startTest.WaitOne();
            int initialWorkItemCount = ProcessorCount + threadCount;
            for (int i = 0; i < initialWorkItemCount; ++i)
                Task.Run(localInitialWorkItemStart);
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void TaskBurstWorkThroughput(int threadCount, int maxWorkItemCount)
    {
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        ThreadPool.SetMinThreads(threadCount, threadCount);
        ThreadPool.SetMaxThreads(threadCount, threadCount);

        Action<object> workItemStart = null;
        workItemStart = async data =>
        {
            Task t0 = null, t1 = null;
            int toQueue = (int)data;
            if (toQueue > 1)
            {
                var localWorkItemStart = workItemStart;
                --toQueue;
                t0 = new Task(localWorkItemStart, toQueue - toQueue / 2);
                t0.Start();
                t1 = new Task(localWorkItemStart, toQueue / 2);
                t1.Start();
            }
            else if (toQueue != 0)
            {
                t0 = new Task(workItemStart, 0);
                t0.Start();
            }
            var rng = t_rng;
            if (rng == null)
                t_rng = rng = new Random(0);
            Delay(RandomShortDelay(rng));
            Interlocked.Increment(ref threadOperationCounts[16]);
            if (t0 != null)
            {
                await t0;
                if (t1 != null)
                    await t1;
            }
        };

        var producerThread = new Thread(() =>
        {
            var localMaxWorkItemCount = maxWorkItemCount;
            var localWorkItemStart = workItemStart;
            startTest.WaitOne();
            while (true)
            {
                var t = new Task(localWorkItemStart, localMaxWorkItemCount - 1);
                t.Start();
                t.Wait();
            }
        });
        producerThread.IsBackground = true;
        producerThread.Start();

        Run(startTest, threadOperationCounts, hasOneResult: true);
    }

    private static void MonitorReliableEnterExitLatency(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        int previousLockThreadId = -1;
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            int threadId = Environment.CurrentManagedThreadId;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                {
                    previousLockThreadId = threadId;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                while (previousLockThreadId == threadId)
                    Delay(4);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    lock (localM)
                        Delay(d0);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    lock (localM)
                    {
                    }
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    Monitor.Enter(localM);
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    Monitor.Enter(localM);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitThroughput(int threadCount, bool delay, bool convertToAwareLock)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        if (convertToAwareLock)
            Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localDelay = delay;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = localDelay ? new Random(threadIndex) : null;
            threadReady.Set();
            if (convertToAwareLock)
            {
                Monitor.Enter(localM);
                Monitor.Exit(localM);
            }
            startTest.WaitOne();
            if (localDelay)
            {
                while (true)
                {
                    var d0 = RandomShortDelay(rng);
                    var d1 = RandomShortDelay(rng);
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Delay(d0);
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                    Delay(d1);
                }
            }
            else
            {
                while (true)
                {
                    if (!Monitor.TryEnter(localM, -1))
                        return;
                    Monitor.Exit(localM);
                    ++localThreadOperationCounts[threadIndex];
                }
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        if (convertToAwareLock)
        {
            Thread.Sleep(50);
            Monitor.Exit(m);
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExit1PcTOtherWorkThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var otherWorkThreadOperationCounts = new int[(ProcessorCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            var rng = new Random((int)data);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localM)
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        ParameterizedThreadStart otherWorkThreadStart = data =>
        {
            int threadIndex = (int)data;
            var localOtherWorkThreadOperationCounts = otherWorkThreadOperationCounts;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Delay(RandomShortDelay(rng));
                ++localOtherWorkThreadOperationCounts[threadIndex];
            }
        };
        var otherWorkThreads = new Thread[ProcessorCount];
        for (int i = 0; i < otherWorkThreads.Length; ++i)
        {
            var t = new Thread(otherWorkThreadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            otherWorkThreads[i] = t;
        }

        RunWithOtherWork(startTest, threadOperationCounts, otherWorkThreadOperationCounts);
    }

    private static void MonitorReliableEnterExitRoundRobinThroughput(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var mutexes = new object[threadCount];
        for (int i = 0; i < mutexes.Length; ++i)
            mutexes[i] = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localMutexes = mutexes;
            int mutexCount = localMutexes.Length;
            int mutexIndex = (threadIndex / 16 - 1) % mutexCount;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);
                lock (localMutexes[mutexIndex])
                    Delay(d0);
                ++localThreadOperationCounts[threadIndex];
                Delay(d1);
                mutexIndex = (mutexIndex + 1) % mutexCount;
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorReliableEnterExitFairness(int threadCount)
    {
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var threadWaitDurationsUs = new double[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localThreadWaitDurationsUs = threadWaitDurationsUs;
            var localM = m;
            var rng = new Random(threadIndex);
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                var d0 = RandomShortDelay(rng);
                var d1 = RandomShortDelay(rng);

                var startTicks = Clock.Ticks;
                long stopTicks;
                lock (localM)
                {
                    stopTicks = Clock.Ticks;
                    Delay(d0);
                }
                ++localThreadOperationCounts[threadIndex];
                localThreadWaitDurationsUs[threadIndex] +=
                    BiasWaitDurationUsAgainstLongWaits(Clock.TicksToUs(stopTicks - startTicks));
                Delay(d1);
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        RunFairness(startTest, threadOperationCounts, threadWaitDurationsUs);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterExitWhenUnlockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            Monitor.Enter(localM);
            Monitor.Exit(localM);
            startTest.WaitOne();
            while (true)
            {
                if (!Monitor.TryEnter(localM))
                    return;
                Monitor.Exit(localM);
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);
        Monitor.Exit(m);

        Run(startTest, threadOperationCounts);
    }

    private static void MonitorTryEnterWhenLockedThroughput_ThinLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static void MonitorTryEnterWhenLockedThroughput_AwareLock(int threadCount)
    {
        threadCount = 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];
        var m = new object();

        Monitor.Enter(m);

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var localM = m;
            threadReady.Set();
            if (Monitor.TryEnter(localM, 50))
                return;
            startTest.WaitOne();
            while (true)
            {
                if (Monitor.TryEnter(localM))
                    return;
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Thread.Sleep(50);

        Run(startTest, threadOperationCounts);
        Monitor.Exit(m);
    }

    private static unsafe void BufferMemoryCopyThroughput(int maxBytes)
    {
        const int threadCount = 1;
        int minBytes = maxBytes <= 8 ? 1 : maxBytes / 2 + 1;
        var threadReady = new AutoResetEvent(false);
        var startTest = new ManualResetEvent(false);
        var threadOperationCounts = new int[(threadCount + 1) * 16];

        ParameterizedThreadStart threadStart = data =>
        {
            int threadIndex = (int)data;
            var localThreadOperationCounts = threadOperationCounts;
            var rng = new Random(0);
            var src = stackalloc byte[maxBytes];
            var dst = stackalloc byte[maxBytes];
            for (int i = 0; i < maxBytes; ++i)
                src[i] = (byte)rng.Next();
            threadReady.Set();
            startTest.WaitOne();
            while (true)
            {
                Buffer.MemoryCopy(src, dst, maxBytes, rng.Next(minBytes, maxBytes + 1));
                ++localThreadOperationCounts[threadIndex];
            }
        };
        var threads = new Thread[threadCount];
        for (int i = 0; i < threads.Length; ++i)
        {
            var t = new Thread(threadStart);
            t.IsBackground = true;
            t.Start((i + 1) * 16);
            threadReady.WaitOne();
            threads[i] = t;
        }

        Run(startTest, threadOperationCounts, iterations: 1);
    }

    private static void Run(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        bool hasOneResult = false,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var operationCounts = new long[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];

            double score = operationCounts.Sum() / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunWithOtherWork(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        int[] otherWorkThreadOperationCounts,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadOperationCounts.Length / 16 - 1;
        int otherWorkThreadCount = otherWorkThreadOperationCounts.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var otherWorkAfterWarmupOperationCounts = new long[otherWorkThreadCount];
        var operationCounts = new long[threadCount];
        var otherWorkOperationCounts = new long[otherWorkThreadCount];
        var operationCountSums = new double[2];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < afterWarmupOperationCounts.Length; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkAfterWarmupOperationCounts.Length; ++i)
                otherWorkAfterWarmupOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] = otherWorkThreadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < operationCounts.Length; ++i)
                operationCounts[i] -= afterWarmupOperationCounts[i];
            for (int i = 0; i < otherWorkOperationCounts.Length; ++i)
                otherWorkOperationCounts[i] -= otherWorkAfterWarmupOperationCounts[i];

            operationCountSums[0] = operationCounts.Sum();
            operationCountSums[1] = otherWorkOperationCounts.Sum();
            double score = operationCountSums.GeometricMean(1, otherWorkThreadCount) / sw.Elapsed.TotalMilliseconds;
            Console.WriteLine("Score: {0:0.000000}", score);
        }
    }

    private static void RunFairness(
        ManualResetEvent startTest,
        int[] threadOperationCounts,
        double[] threadWaitDurationsUs,
        int iterations = 4)
    {
        var sw = new Stopwatch();
        int threadCount = threadWaitDurationsUs.Length / 16 - 1;
        var afterWarmupOperationCounts = new long[threadCount];
        var afterWarmupWaitDurationsUs = new double[threadCount];
        var operationCounts = new long[threadCount];
        var waitDurationsUs = new double[threadCount];
        startTest.Set();

        // Warmup

        Thread.Sleep(100);

        //while (true)
        for (int j = 0; j < iterations; ++j)
        {
            for (int i = 0; i < threadCount; ++i)
                afterWarmupOperationCounts[i] = threadOperationCounts[(i + 1) * 16];
            for (int i = 0; i < threadCount; ++i)
                afterWarmupWaitDurationsUs[i] = threadWaitDurationsUs[(i + 1) * 16];

            // Measure

            sw.Restart();
            Thread.Sleep(500);
            sw.Stop();

            for (int i = 0; i < threadCount; ++i)
            {
                int ti = (i + 1) * 16;
                operationCounts[i] = threadOperationCounts[ti];
                waitDurationsUs[i] = threadWaitDurationsUs[ti];
            }
            for (int i = 0; i < threadCount; ++i)
            {
                operationCounts[i] -= afterWarmupOperationCounts[i];
                waitDurationsUs[i] -= afterWarmupWaitDurationsUs[i];
            }

            double averageWaitDurationUs = Math.Sqrt(waitDurationsUs.Sum() / operationCounts.Sum());
            if (averageWaitDurationUs < 1)
                averageWaitDurationUs = 1;
            double score = 100_000 / averageWaitDurationUs;
            Console.WriteLine($"Score: {score:0.000000}");
        }
    }

    private static double BiasWaitDurationUsAgainstLongWaits(double waitDurationUs) =>
        waitDurationUs <= 1 ? 1 : waitDurationUs * waitDurationUs;

    internal static class Clock
    {
        private static readonly long s_swFrequency = Stopwatch.Frequency;
        private static readonly double s_swFrequencyDouble = s_swFrequency;

        public static long Ticks => Stopwatch.GetTimestamp();
        public static double TicksToS(long ticks) => ticks / s_swFrequencyDouble;
        public static double TicksToMs(long ticks) => ticks * 1000 / s_swFrequencyDouble;
        public static double TicksToUs(long ticks) => ticks * (1000 * 1000) / s_swFrequencyDouble;
    }

    private static uint RandomShortDelay(Random rng) => (uint)rng.Next(4, 10);
    private static uint RandomMediumDelay(Random rng) => (uint)rng.Next(10, 15);
    private static uint RandomLongDelay(Random rng) => (uint)rng.Next(15, 20);

    private static int[] s_delayValues = new int[32];

    private static void Delay(uint n)
    {
        Interlocked.MemoryBarrier();
        s_delayValues[16] += (int)Fib(n);
    }

    private static uint Fib(uint n)
    {
        if (n <= 1)
            return n;
        return Fib(n - 2) + Fib(n - 1);
    }
}

kouvel · 2017-08-30T02:16:34Z

Code used for ReaderWriterLockSlim perf:

            var sw = new Stopwatch();
            var scores = new double[16];
            var startThreads = new ManualResetEvent(false);
            bool stop = false;

            var counts = new int[64];
            var readerThreads = new Thread[readerThreadCount];
            ThreadStart readThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterReadLock();
                        rw.ExitReadLock();
                        Interlocked.Increment(ref counts[16]);
                    }
                };
            for (int i = 0; i < readerThreadCount; ++i)
            {
                readerThreads[i] = new Thread(readThreadStart);
                readerThreads[i].IsBackground = true;
                readerThreads[i].Start();
            }

            var writeLockAcquireAndReleasedInnerIterationCountTimes = new AutoResetEvent(false);
            var writerThreads = new Thread[writerThreadCount];
            ThreadStart writeThreadStart =
                () =>
                {
                    startThreads.WaitOne();
                    while (!stop)
                    {
                        rw.EnterWriteLock();
                        rw.ExitWriteLock();
                        Interlocked.Increment(ref counts[32]);
                    }
                };
            for (int i = 0; i < writerThreadCount; ++i)
            {
                writerThreads[i] = new Thread(writeThreadStart);
                writerThreads[i].IsBackground = true;
                writerThreads[i].Start();
            }

            startThreads.Set();

            // Warmup

            Thread.Sleep(4000);

            // Actual run
            for(int i = 0; i < scores.Length; ++i)
            {
                counts[16] = 0;
                counts[32] = 0;
                Interlocked.MemoryBarrier();

                sw.Restart();
                Thread.Sleep(500);
                sw.Stop();

                int readCount = counts[16];
                int writeCount = counts[32];

                double elapsedMs = sw.Elapsed.TotalMilliseconds;
                scores[i] =
                    new double[]
                    {
                        Math.Max(1, (readCount + writeCount) / elapsedMs),
                        Math.Max(1, writeCount / elapsedMs)
                    }.GeometricMean(readerThreadCount, writerThreadCount);
            }

            return scores;

stephentoub · 2017-08-30T02:29:20Z

The only use of Thread.SpinWait I found in the thread pool is in RegisteredWaitHandleSafe.Unregister, which I don't think is interesting. I have not measured the perf for Task.SpinWait, I can do that if you would like.

ThreadPool's global queue is a ConcurrentQueue, and CQ uses System.Threading.SpinWait when there are contentions on various operations, including dequeues.

kouvel · 2017-08-30T02:40:27Z

Ah ok, I included ConcurrentQueue, I'll add a test for thread pool as well

kouvel · 2017-08-30T17:28:33Z

Updated code above with the added thread pool throughput test. Looks like there's no change:

Xeon E5-1650 (Sandy Bridge, 6-core, 12-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    7322.26 ±0.65%  7443.96 ±0.73%   121.71      1.66%
ThreadPoolThroughput 2Pc                    7377.70 ±0.63%  7467.42 ±0.82%    89.72      1.22%
ThreadPoolThroughput 4Pc                    7329.01 ±0.75%  7330.87 ±1.00%     1.86      0.03%
------------------------------------------  --------------  --------------  -------  ---------

Core i7-6700 (Skylake, 4-core, 8-thread):

Spin                                        Left score      Right score     ∆ Score  ∆ Score %
------------------------------------------  --------------  --------------  -------  ---------
ThreadPoolThroughput 1Pc                    9434.79 ±0.55%  9484.14 ±0.54%    49.35      0.52%
ThreadPoolThroughput 2Pc                    9384.44 ±0.41%  9376.15 ±0.41%    -8.30     -0.09%
ThreadPoolThroughput 4Pc                    9390.46 ±0.62%  9387.43 ±0.75%    -3.03     -0.03%
------------------------------------------  --------------  --------------  -------  ---------

kouvel · 2017-08-31T08:31:23Z

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

kouvel · 2017-08-31T08:31:43Z

@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness

kouvel · 2017-08-31T08:31:56Z

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness

kouvel · 2017-08-31T08:32:20Z

@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness

tarekgh · 2017-08-31T15:33:20Z

src/mscorlib/shared/System/Threading/SpinWait.cs

+        /// A suggested number of spin iterations before doing a proper wait, such as waiting on an event that becomes signaled
+        /// when the resource becomes available.
+        /// </summary>
+        internal static readonly int SpinCountforSpinBeforeWait = PlatformHelper.IsSingleProcessor ? 1 : 35;


35 [](start = 105, length = 2)

did we get this number from experimenting different scenarios? just curious how we come up with this number. and it doesn't matter the number of processors?

I experimented with ManualResetEventSlim to get an initial number, applied the same number to other similar situations, and then tweaked up and down to see what was working. Spinning less can lead to early waiting and more context switching, spinning more can decrease latency but may use up some CPU time unnecessarily. Depends on the situation too, like for SemaphoreSlim I had to double the spin iterations because the waiting there is a lot more expensive. Also depends on the likelihood of the spin being successful and how long the wait would be but those are not accounted for here.

I don't think including number of processors (N) works well. Multiplying by N increases spinning on each thread by N, so total spinning across N threads is increased by N^2. When there are more processors contending on a resource, it may even be better to spin less and wait sooner to reduce contention since with more processors something like a mutex has the natural possibility of having more contention.

tarekgh · 2017-08-31T15:45:40Z

src/mscorlib/shared/System/Threading/SpinWait.cs

+                // usually better for that.
+                //
+                int n = RuntimeThread.OptimalMaxSpinWaitsPerSpinIteration;
+                if (_count <= 30 && (1 << _count) < n)


30 [](start = 30, length = 2)

would be nice to comment how we choose this number.

I'll add a reference to Thread::InitializeYieldProcessorNormalized that describes and calculates it

tarekgh · 2017-08-31T15:50:49Z

src/mscorlib/src/Internal/Runtime/Augments/RuntimeThread.cs

+        {
+            get
+            {
+                if (s_optimalMaxSpinWaitsPerSpinIteration != 0)


s_optimalMaxSpinWaitsPerSpinIteration [](start = 20, length = 37)

Looks this one can be converted to readonly field initialized with GetOptimalMaxSpinWaitsPerSpinIterationInternal() so we can avoid checking 0 value.

I didn't want do that since the first call would trigger the measurement that takes about 10 ms. Static construction of RuntimeThread probably happens during startup for most apps.

tarekgh · 2017-08-31T16:04:47Z

src/mscorlib/src/System/Threading/Tasks/Task.cs

            }

-            return IsCompleted;
+            return false;


return false; [](start = 11, length = 14)

Is it possible between exiting the loop and executing the return, the task can get into completed state? I am asking to know if we should keep returning IsCompleted

Functionally it doesn't make any difference, the caller will do the right thing. Previously it made sense to check IsCompleted before returning because the loop would have stopped immediately after a wait. But previously it was redundant to check IsCompleted first in the loop because it was already checked immediately before before the loop. So I just changed the loop to wait first and check later, now the loop exits right after checking IsCompleted and it would be redundant to check it again before returning.

tarekgh

In dotnet#13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.

In #13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.

Closes https://github.com/dotnet/coreclr/issues/5928 Replaced UnfairSemaphore with a new implementation in CLRLifoSemaphore - UnfairSemaphore had a some benefits: - It tracked the number of spinners and avoids waking up waiters as long as the signal count can be satisfied by spinners - Since spinners get priority over waiters, that's the main "unfair" part of it that allows hot threads to remain hot and cold threads to remain cold. However, waiters are still released in FIFO order. - Spinning helps with throughput when incoming work is bursty - All of the above benefits were retained in CLRLifoSemaphore and some were improved: - Similarly to UnfairSemaphore, the number of spinners are tracked and preferenced to avoid waking up waiters - For waiting, on Windows, a I/O completion port is used since it releases waiters in LIFO order. For Unix, added a prioritized wait function to the PAL to register waiters in reverse order for LIFO release behavior. This allows cold waiters to time out more easily since they will be used less frequently. - Similarly to SemaphoreSlim, the number of waiters that were signaled to wake but have not yet woken is tracked to help avoid waking up an excessive number of waiters - Added some YieldProcessorNormalized() calls to the spin loop. This avoids thrashing on Sleep(0) by adding a delay to the spin loop to allow it to be more effective when there are no threads to switch to, or the only other threads to switch to are other similar spinners. - Removed the processor count multiplier on the max spin count and retuned the default max spin count. The processor count multiplier was causing excessive CPU usage on machines with many processors. Perf results For the test case in https://github.com/dotnet/coreclr/issues/5928, CPU time spent in UnfairSemaphore::Wait was halved. CPU time % spent in UnfairSemaphore::Wait relative to time spent in WorkerThreadStart reduced from about 88% to 78%. Updated spin perf code here: dotnet#13670 - NPc = (N * proc count) threads - MPcWi = (M * proc count) work items - BurstWorkThroughput queues that many work items in a burst, then releases the thread pool threads to process all of them, and once all are processed, repeats - SustainedWorkThroughput has work items queue another of itself with some initial number of work items such that the work item count never reaches zero ``` Spin Left score Right score ∆ Score % -------------------------------------------- -------------- -------------- --------- ThreadPoolBurstWorkThroughput 1Pc 000.25PcWi 276.10 ±1.09% 268.90 ±1.36% -2.61% ThreadPoolBurstWorkThroughput 1Pc 000.50PcWi 362.63 ±0.47% 388.82 ±0.33% 7.22% ThreadPoolBurstWorkThroughput 1Pc 001.00PcWi 498.33 ±0.32% 797.01 ±0.29% 59.94% ThreadPoolBurstWorkThroughput 1Pc 004.00PcWi 1222.52 ±0.42% 1348.78 ±0.47% 10.33% ThreadPoolBurstWorkThroughput 1Pc 016.00PcWi 1672.72 ±0.48% 1724.06 ±0.47% 3.07% ThreadPoolBurstWorkThroughput 1Pc 064.00PcWi 1853.94 ±0.25% 1868.36 ±0.45% 0.78% ThreadPoolBurstWorkThroughput 1Pc 256.00PcWi 1849.30 ±0.24% 1902.58 ±0.48% 2.88% ThreadPoolSustainedWorkThroughput 1Pc 1495.62 ±0.78% 1505.89 ±0.20% 0.69% -------------------------------------------- -------------- -------------- --------- Total 922.22 ±0.51% 1004.59 ±0.51% 8.93% ``` Numbers on Linux were similar with a slightly different spread and no regressions. I also tried the plaintext benchmark from https://github.com/aspnet/benchmarks on Windows (couldn't get it to build on Linux at the time). No noticeable change to throughput or latency, and the CPU time spent in UnfairSemaphore::Wait decreased a little from ~2% to ~0.5% in CLRLifoSemaphore::Wait.

@jkotas

- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR dotnet#13670 My guess is that these regressions are small and unlikely to materialize into real-world regressions. It would simplify and ease maintenance a bit to remove the asm, but since it looks like the register allocation issues would not be resolved easily, I'm not sure if we want to remove the asm code at this time. @jkotas and @vancem, thoughts? Numbers (no PGO): Windows x64 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 800.56 ±0.33% 821.97 ±0.30% 2.67% MonitorEnterExitLatency 4T 1533.25 ±0.34% 1553.82 ±0.13% 1.34% MonitorEnterExitLatency 7T 1676.14 ±0.26% 1678.14 ±0.18% 0.12% MonitorEnterExitThroughput Delay 1T 5174.77 ±0.25% 5125.56 ±0.27% -0.95% MonitorEnterExitThroughput Delay 2T 4982.38 ±0.22% 4937.79 ±0.19% -0.90% MonitorEnterExitThroughput Delay 4T 4720.41 ±0.37% 4694.09 ±0.24% -0.56% MonitorEnterExitThroughput Delay 7T 3741.20 ±0.33% 3778.06 ±0.20% 0.99% MonitorEnterExitThroughput_AwareLock 1T 63445.04 ±0.20% 61540.28 ±0.23% -3.00% MonitorEnterExitThroughput_ThinLock 1T 59720.83 ±0.20% 59754.62 ±0.12% 0.06% MonitorReliableEnterExitLatency 2T 809.31 ±0.23% 809.58 ±0.41% 0.03% MonitorReliableEnterExitLatency 4T 1569.47 ±0.45% 1577.43 ±0.71% 0.51% MonitorReliableEnterExitLatency 7T 1681.65 ±0.25% 1678.01 ±0.20% -0.22% MonitorReliableEnterExitThroughput Delay 1T 4956.40 ±0.41% 4957.46 ±0.24% 0.02% MonitorReliableEnterExitThroughput Delay 2T 4794.52 ±0.18% 4756.23 ±0.25% -0.80% MonitorReliableEnterExitThroughput Delay 4T 4560.22 ±0.25% 4522.03 ±0.35% -0.84% MonitorReliableEnterExitThroughput Delay 7T 3902.19 ±0.55% 3875.81 ±0.13% -0.68% MonitorReliableEnterExitThroughput_AwareLock 1T 61944.11 ±0.20% 58083.95 ±0.08% -6.23% MonitorReliableEnterExitThroughput_ThinLock 1T 59632.31 ±0.25% 58972.48 ±0.07% -1.11% MonitorTryEnterExitThroughput_AwareLock 1T 62345.13 ±0.14% 57159.99 ±0.14% -8.32% MonitorTryEnterExitThroughput_ThinLock 1T 59725.76 ±0.15% 58050.35 ±0.16% -2.81% ------------------------------------------------ --------------- --------------- --------- Total 6795.49 ±0.28% 6723.21 ±0.23% -1.06% ``` Windows x86 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 958.97 ±0.37% 987.28 ±0.32% 2.95% MonitorEnterExitLatency 4T 1675.18 ±0.41% 1704.64 ±0.08% 1.76% MonitorEnterExitLatency 7T 1825.49 ±0.09% 1769.50 ±0.12% -3.07% MonitorEnterExitThroughput Delay 1T 5083.01 ±0.27% 5047.10 ±0.37% -0.71% MonitorEnterExitThroughput Delay 2T 4854.54 ±0.13% 4825.31 ±0.14% -0.60% MonitorEnterExitThroughput Delay 4T 4628.89 ±0.17% 4579.92 ±0.56% -1.06% MonitorEnterExitThroughput Delay 7T 4125.52 ±0.48% 4096.78 ±0.20% -0.70% MonitorEnterExitThroughput_AwareLock 1T 61841.28 ±0.13% 57429.31 ±0.44% -7.13% MonitorEnterExitThroughput_ThinLock 1T 59746.69 ±0.19% 57971.43 ±0.10% -2.97% MonitorReliableEnterExitLatency 2T 983.26 ±0.22% 998.25 ±0.33% 1.52% MonitorReliableEnterExitLatency 4T 1758.10 ±0.14% 1723.63 ±0.19% -1.96% MonitorReliableEnterExitLatency 7T 1832.24 ±0.08% 1776.61 ±0.10% -3.04% MonitorReliableEnterExitThroughput Delay 1T 5023.19 ±0.05% 4980.49 ±0.08% -0.85% MonitorReliableEnterExitThroughput Delay 2T 4846.04 ±0.03% 4792.58 ±0.11% -1.10% MonitorReliableEnterExitThroughput Delay 4T 4608.14 ±0.09% 4574.90 ±0.06% -0.72% MonitorReliableEnterExitThroughput Delay 7T 4123.20 ±0.10% 4075.92 ±0.11% -1.15% MonitorReliableEnterExitThroughput_AwareLock 1T 57951.11 ±0.11% 57006.12 ±0.21% -1.63% MonitorReliableEnterExitThroughput_ThinLock 1T 58006.06 ±0.18% 58018.28 ±0.07% 0.02% MonitorTryEnterExitThroughput_AwareLock 1T 60701.63 ±0.04% 53374.77 ±0.15% -12.07% MonitorTryEnterExitThroughput_ThinLock 1T 58169.82 ±0.05% 56023.58 ±0.69% -3.69% ------------------------------------------------ --------------- --------------- --------- Total 7037.46 ±0.17% 6906.42 ±0.22% -1.86% ``` Linux x64 ``` Spin repeater Left score Right score ∆ Score % ----------------------------------------------- --------------- --------------- --------- MonitorEnterExitLatency 2T 3755.92 ±1.51% 3802.80 ±0.62% 1.25% MonitorEnterExitLatency 4T 3448.14 ±1.69% 3493.84 ±1.58% 1.33% MonitorEnterExitLatency 7T 2593.97 ±0.13% 2655.21 ±0.15% 2.36% MonitorEnterExitThroughput Delay 1T 4854.52 ±0.12% 4873.08 ±0.11% 0.38% MonitorEnterExitThroughput Delay 2T 4659.19 ±0.85% 4695.61 ±0.38% 0.78% MonitorEnterExitThroughput Delay 4T 4163.01 ±1.46% 4190.94 ±1.37% 0.67% MonitorEnterExitThroughput Delay 7T 3012.69 ±0.45% 3123.75 ±0.32% 3.69% MonitorEnterExitThroughput_AwareLock 1T 56665.09 ±0.16% 58524.86 ±0.24% 3.28% MonitorEnterExitThroughput_ThinLock 1T 57476.36 ±0.68% 57573.08 ±0.61% 0.17% MonitorReliableEnterExitLatency 2T 3952.35 ±0.45% 3937.80 ±0.49% -0.37% MonitorReliableEnterExitLatency 4T 3001.75 ±1.02% 3008.55 ±0.76% 0.23% MonitorReliableEnterExitLatency 7T 2456.20 ±0.65% 2479.78 ±0.09% 0.96% MonitorReliableEnterExitThroughput Delay 1T 4907.10 ±0.85% 4940.83 ±0.23% 0.69% MonitorReliableEnterExitThroughput Delay 2T 4750.81 ±0.62% 4725.81 ±0.87% -0.53% MonitorReliableEnterExitThroughput Delay 4T 4329.93 ±1.18% 4360.67 ±1.04% 0.71% MonitorReliableEnterExitThroughput Delay 7T 3180.52 ±0.27% 3255.88 ±0.51% 2.37% MonitorReliableEnterExitThroughput_AwareLock 1T 54559.89 ±0.09% 55785.74 ±0.20% 2.25% MonitorReliableEnterExitThroughput_ThinLock 1T 55936.06 ±0.36% 55519.74 ±0.80% -0.74% MonitorTryEnterExitThroughput_AwareLock 1T 52694.96 ±0.18% 54282.77 ±0.12% 3.01% MonitorTryEnterExitThroughput_ThinLock 1T 54942.18 ±0.24% 55031.84 ±0.38% 0.16% ----------------------------------------------- --------------- --------------- --------- Total 8326.45 ±0.65% 8420.07 ±0.54% 1.12% ```

- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR dotnet#13670 Numbers (no PGO): Windows x64 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 800.56 ±0.33% 821.97 ±0.30% 2.67% MonitorEnterExitLatency 4T 1533.25 ±0.34% 1553.82 ±0.13% 1.34% MonitorEnterExitLatency 7T 1676.14 ±0.26% 1678.14 ±0.18% 0.12% MonitorEnterExitThroughput Delay 1T 5174.77 ±0.25% 5125.56 ±0.27% -0.95% MonitorEnterExitThroughput Delay 2T 4982.38 ±0.22% 4937.79 ±0.19% -0.90% MonitorEnterExitThroughput Delay 4T 4720.41 ±0.37% 4694.09 ±0.24% -0.56% MonitorEnterExitThroughput Delay 7T 3741.20 ±0.33% 3778.06 ±0.20% 0.99% MonitorEnterExitThroughput_AwareLock 1T 63445.04 ±0.20% 61540.28 ±0.23% -3.00% MonitorEnterExitThroughput_ThinLock 1T 59720.83 ±0.20% 59754.62 ±0.12% 0.06% MonitorReliableEnterExitLatency 2T 809.31 ±0.23% 809.58 ±0.41% 0.03% MonitorReliableEnterExitLatency 4T 1569.47 ±0.45% 1577.43 ±0.71% 0.51% MonitorReliableEnterExitLatency 7T 1681.65 ±0.25% 1678.01 ±0.20% -0.22% MonitorReliableEnterExitThroughput Delay 1T 4956.40 ±0.41% 4957.46 ±0.24% 0.02% MonitorReliableEnterExitThroughput Delay 2T 4794.52 ±0.18% 4756.23 ±0.25% -0.80% MonitorReliableEnterExitThroughput Delay 4T 4560.22 ±0.25% 4522.03 ±0.35% -0.84% MonitorReliableEnterExitThroughput Delay 7T 3902.19 ±0.55% 3875.81 ±0.13% -0.68% MonitorReliableEnterExitThroughput_AwareLock 1T 61944.11 ±0.20% 58083.95 ±0.08% -6.23% MonitorReliableEnterExitThroughput_ThinLock 1T 59632.31 ±0.25% 58972.48 ±0.07% -1.11% MonitorTryEnterExitThroughput_AwareLock 1T 62345.13 ±0.14% 57159.99 ±0.14% -8.32% MonitorTryEnterExitThroughput_ThinLock 1T 59725.76 ±0.15% 58050.35 ±0.16% -2.81% ------------------------------------------------ --------------- --------------- --------- Total 6795.49 ±0.28% 6723.21 ±0.23% -1.06% ``` Windows x86 ``` Spin Left score Right score ∆ Score % ------------------------------------------------ --------------- --------------- --------- MonitorEnterExitLatency 2T 958.97 ±0.37% 987.28 ±0.32% 2.95% MonitorEnterExitLatency 4T 1675.18 ±0.41% 1704.64 ±0.08% 1.76% MonitorEnterExitLatency 7T 1825.49 ±0.09% 1769.50 ±0.12% -3.07% MonitorEnterExitThroughput Delay 1T 5083.01 ±0.27% 5047.10 ±0.37% -0.71% MonitorEnterExitThroughput Delay 2T 4854.54 ±0.13% 4825.31 ±0.14% -0.60% MonitorEnterExitThroughput Delay 4T 4628.89 ±0.17% 4579.92 ±0.56% -1.06% MonitorEnterExitThroughput Delay 7T 4125.52 ±0.48% 4096.78 ±0.20% -0.70% MonitorEnterExitThroughput_AwareLock 1T 61841.28 ±0.13% 57429.31 ±0.44% -7.13% MonitorEnterExitThroughput_ThinLock 1T 59746.69 ±0.19% 57971.43 ±0.10% -2.97% MonitorReliableEnterExitLatency 2T 983.26 ±0.22% 998.25 ±0.33% 1.52% MonitorReliableEnterExitLatency 4T 1758.10 ±0.14% 1723.63 ±0.19% -1.96% MonitorReliableEnterExitLatency 7T 1832.24 ±0.08% 1776.61 ±0.10% -3.04% MonitorReliableEnterExitThroughput Delay 1T 5023.19 ±0.05% 4980.49 ±0.08% -0.85% MonitorReliableEnterExitThroughput Delay 2T 4846.04 ±0.03% 4792.58 ±0.11% -1.10% MonitorReliableEnterExitThroughput Delay 4T 4608.14 ±0.09% 4574.90 ±0.06% -0.72% MonitorReliableEnterExitThroughput Delay 7T 4123.20 ±0.10% 4075.92 ±0.11% -1.15% MonitorReliableEnterExitThroughput_AwareLock 1T 57951.11 ±0.11% 57006.12 ±0.21% -1.63% MonitorReliableEnterExitThroughput_ThinLock 1T 58006.06 ±0.18% 58018.28 ±0.07% 0.02% MonitorTryEnterExitThroughput_AwareLock 1T 60701.63 ±0.04% 53374.77 ±0.15% -12.07% MonitorTryEnterExitThroughput_ThinLock 1T 58169.82 ±0.05% 56023.58 ±0.69% -3.69% ------------------------------------------------ --------------- --------------- --------- Total 7037.46 ±0.17% 6906.42 ±0.22% -1.86% ``` Linux x64 ``` Spin repeater Left score Right score ∆ Score % ----------------------------------------------- --------------- --------------- --------- MonitorEnterExitLatency 2T 3755.92 ±1.51% 3802.80 ±0.62% 1.25% MonitorEnterExitLatency 4T 3448.14 ±1.69% 3493.84 ±1.58% 1.33% MonitorEnterExitLatency 7T 2593.97 ±0.13% 2655.21 ±0.15% 2.36% MonitorEnterExitThroughput Delay 1T 4854.52 ±0.12% 4873.08 ±0.11% 0.38% MonitorEnterExitThroughput Delay 2T 4659.19 ±0.85% 4695.61 ±0.38% 0.78% MonitorEnterExitThroughput Delay 4T 4163.01 ±1.46% 4190.94 ±1.37% 0.67% MonitorEnterExitThroughput Delay 7T 3012.69 ±0.45% 3123.75 ±0.32% 3.69% MonitorEnterExitThroughput_AwareLock 1T 56665.09 ±0.16% 58524.86 ±0.24% 3.28% MonitorEnterExitThroughput_ThinLock 1T 57476.36 ±0.68% 57573.08 ±0.61% 0.17% MonitorReliableEnterExitLatency 2T 3952.35 ±0.45% 3937.80 ±0.49% -0.37% MonitorReliableEnterExitLatency 4T 3001.75 ±1.02% 3008.55 ±0.76% 0.23% MonitorReliableEnterExitLatency 7T 2456.20 ±0.65% 2479.78 ±0.09% 0.96% MonitorReliableEnterExitThroughput Delay 1T 4907.10 ±0.85% 4940.83 ±0.23% 0.69% MonitorReliableEnterExitThroughput Delay 2T 4750.81 ±0.62% 4725.81 ±0.87% -0.53% MonitorReliableEnterExitThroughput Delay 4T 4329.93 ±1.18% 4360.67 ±1.04% 0.71% MonitorReliableEnterExitThroughput Delay 7T 3180.52 ±0.27% 3255.88 ±0.51% 2.37% MonitorReliableEnterExitThroughput_AwareLock 1T 54559.89 ±0.09% 55785.74 ±0.20% 2.25% MonitorReliableEnterExitThroughput_ThinLock 1T 55936.06 ±0.36% 55519.74 ±0.80% -0.74% MonitorTryEnterExitThroughput_AwareLock 1T 52694.96 ±0.18% 54282.77 ±0.12% 3.01% MonitorTryEnterExitThroughput_ThinLock 1T 54942.18 ±0.24% 55031.84 ±0.38% 0.16% ----------------------------------------------- --------------- --------------- --------- Total 8326.45 ±0.65% 8420.07 ±0.54% 1.12% ```

- Removed asm helpers on Windows and used portable C++ helpers instead - Rearranged fast path code to improve them a bit and match the asm more closely Perf: - The asm helpers are a bit faster. The code generated for the portable helpers is almost the same now, the remaining differences are: - There were some layout issues where hot paths were in the wrong place and return paths were not cloned. Instrumenting some of the tests below with PGO on x64 resolved all of the layout issues. I couldn't get PGO instrumentation to work on x86 but I imagine it would be the same there. - Register usage - x64: All of the Enter functions are using one or two (TryEnter is using two) callee-saved registers for no apparent reason, forcing them to be saved and restored. r10 and r11 seem to be available but they're not being used. - x86: Similarly to x64, the compiled functions are pushing and popping 2-3 additional registers in the hottest fast paths. - I believe this is the main remaining gap and PGO is not helping with this - On Linux, perf is >= before for the most part - Perf tests used for below are updated in PR #13670

kouvel added area-System.Threading tenet-performance Performance related issue labels Aug 30, 2017

kouvel added this to the 2.1.0 milestone Aug 30, 2017

kouvel self-assigned this Aug 30, 2017

dnfclas added the cla-already-signed label Aug 30, 2017

kouvel requested review from stephentoub and vancem August 30, 2017 00:08

stephentoub reviewed Aug 30, 2017

View reviewed changes

kouvel requested a review from tarekgh August 31, 2017 09:54

tarekgh reviewed Aug 31, 2017

View reviewed changes

tarekgh approved these changes Aug 31, 2017

View reviewed changes

kouvel added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Aug 31, 2017

kouvel merged commit 03bf95c into dotnet:master Sep 1, 2017

kouvel deleted the SpinFix branch September 1, 2017 20:09

kouvel mentioned this pull request Sep 2, 2017

Fix SemaphoreSlim throughput #13766

Merged

kouvel mentioned this pull request Sep 7, 2017

ConcurrentQueue 128-byte cache line dotnet/corefx#22724

Merged

kouvel mentioned this pull request Sep 12, 2017

Improve thread pool worker thread's spinning for work #13921

Merged

kouvel mentioned this pull request Sep 23, 2017

Remove Monitor asm helpers #14146

Merged

kouvel mentioned this pull request Sep 27, 2017

Improve Monitor scaling #14216

Merged

kouvel mentioned this pull request May 30, 2018

[WIP] Threadpool exploration RFC #18158

Closed

benaadams mentioned this pull request Jun 3, 2018

Add normalized equivalent of YieldProcessor, retune some spin loops mono/mono#9011

Open

kouvel mentioned this pull request Dec 7, 2018

Remove some unnecessary spinning #21437

Merged

kouvel mentioned this pull request Dec 30, 2018

Don't Sleep(1) in some spin-wait loops #21722

Merged

benaadams mentioned this pull request Jan 17, 2019

Associating threadpool queues with CPU cores. #18403

Closed

filipnavara mentioned this pull request Feb 5, 2019

Improve thread pool performance on Unix dotnet/corert#6955

Merged

kouvel mentioned this pull request Mar 6, 2019

Implement APIs for some threading metrics (CoreCLR) #22754

Closed

MichalStrehovsky added a commit to MichalStrehovsky/corert that referenced this pull request Apr 13, 2019

Port dotnet/coreclr#13670 to CoreRT

41405bd

MichalStrehovsky mentioned this pull request Apr 15, 2019

Add normalized equivalent of YieldProcessor dotnet/corert#7304

Closed

MichalStrehovsky mentioned this pull request Jun 29, 2019

Port normalized SpinWait from CoreCLR dotnet/corert#7569

Merged

filipnavara mentioned this pull request Nov 3, 2019

Add benchmarks for thread pool throughput dotnet/performance#999

Open

This was referenced Jan 31, 2020

Spinning in GC: Consider switching to using YieldProcessorNormalized() dotnet/runtime#8936

Closed

Consider moving initialization of YieldProcessorNormalized into a background thread dotnet/runtime#8937

Closed

gbaraldi mentioned this pull request Apr 24, 2023

Use isb for normal cpu pause on aarch64 JuliaLang/julia#49481

Merged

kouvel mentioned this pull request Jun 23, 2023

Expose a Lock type in preview mode dotnet/runtime#87672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Add normalized equivalent of YieldProcessor, retune some spin loops #13670

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

stephentoub Aug 30, 2017

kouvel Aug 30, 2017

stephentoub Aug 30, 2017

stephentoub Aug 30, 2017

kouvel Aug 30, 2017

stephentoub commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017 •

edited

Loading

kouvel commented Aug 30, 2017

stephentoub commented Aug 30, 2017 •

edited

Loading

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

tarekgh Aug 31, 2017

kouvel Aug 31, 2017

kouvel Aug 31, 2017

tarekgh Aug 31, 2017

kouvel Aug 31, 2017

tarekgh Aug 31, 2017

kouvel Aug 31, 2017

tarekgh Aug 31, 2017

kouvel Aug 31, 2017

tarekgh left a comment

Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Add normalized equivalent of YieldProcessor, retune some spin loops #13670

Conversation

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017 • edited Loading

kouvel commented Aug 30, 2017

stephentoub commented Aug 30, 2017 • edited Loading

kouvel commented Aug 30, 2017

kouvel commented Aug 30, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

kouvel commented Aug 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh left a comment

Choose a reason for hiding this comment

kouvel commented Aug 30, 2017 •

edited

Loading

stephentoub commented Aug 30, 2017 •

edited

Loading