Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spin wait tuning #8744

Closed
helloguo opened this issue Aug 15, 2017 · 49 comments
Closed

Spin wait tuning #8744

helloguo opened this issue Aug 15, 2017 · 49 comments

Comments

@helloguo
Copy link

Spin-Wait can have a significant cost on power and performance if not done properly. The PAUSE instruction on Skylake changed from previous generations such as Haswell and Broadwell, from reference "Intel® 64 and IA-32 Architectures Optimization Reference Manual" section 8.4.7, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles." "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss." We are seeing issues with spinning on Skylake Server with TechEmpower Plaintext with excessive amount of time being spent in spin code. It’s beneficial to tune the spin count or spin cycles and we have a proposal below for fix:

For example, the original spin code form j_join::join is shown below.

                int spin_count = 4096 * g_num_processors;
                for (int j = 0; j < spin_count; j++)
                {
                    if (color != join_struct.lock_color)
                    {
                        break;
                    }
                    YieldProcessor();           // indicate to the processor that we are spinning
                }

Assume YieldProcessor() took 10 cycles when the above code was first written and tuned. We could use spin cycles instead of spin count like this:

                int spin_count = 4096 * g_num_processors;

                // Assume YieldProcessor() took 10 cycles 
                // when the above code was first written and tuned
                ptrdiff_t endTime = get_cycle_count() + spin_count * 10;

                while (get_cycle_count() < endTime)
                {
                    if (color != join_struct.lock_color)
                    {
                        break;
                    }
                    YieldProcessor();           // indicate to the processor that we are spinning
                }
@helloguo
Copy link
Author

/cc @jkotas

@jkotas
Copy link
Member

jkotas commented Aug 15, 2017

cc @vancem

@vancem
Copy link
Contributor

vancem commented Aug 15, 2017

cc @kouvel

@benaadams
Copy link
Member

@redknightlois
Copy link

redknightlois commented Aug 16, 2017

I have been tracking an unusual MAJOR regression (as in capital letters major) from 60K req/sec (running the loading and the target process at the same machine) to 3K req/sec that may be related to this. We have only seen this after upgrading the solution to CoreCLR 2.0. Under profiler the only suspect for the odd behavior is the ManualResetEventSlim. We see an inordinate amount of CPU time being assigned when that should have been very low.

image

It is fair to say that we know our current bottleneck pre 2.0 was thread state handoff, so we expected to be the same on 2.0. But, this behavior took me off guard. If you think it could be related, just let me know and we can connect with Skype to help you build an environment for investigation.

@vancem
Copy link
Contributor

vancem commented Aug 16, 2017

I have been in conversation with @kouvel on spinning, and I would like us to move to make a small modification to spin loops such that they remember the spin count that worked in the past (or more importantly the fact that it did NOT work in the past), and use that to determine how much spinning to do this time (or to bail immediately). This change was motivated to simply reduce spin overhead, but it should also help with this issue.

This does not completely fix the problem, because although you will bail quickly when you bail, the spinning you do do is 14X longer. It seems to me that what we should do is make it so that SpinWait(count) normalizes it argument (that is a count always represents a certain number of machine instruction units (cycles), so on machines where pause is large). This approach may have problems, but it seems like something that should be on the table...

@benaadams
Copy link
Member

/cc @stephentoub FYI

@redknightlois
Copy link

@vancem just point us to the private build for it and we can test it.

@helloguo
Copy link
Author

@redknightlois what's your hardware?

@redknightlois
Copy link

Skylake & Kaby Lake, probably by tomorrow we will have the same test on pre-Skylake.

@helloguo
Copy link
Author

@redknightlois Do you see the same issue on pre-Skylake platforms? If yes, it may suggest something else. If you only see the issue on Skylake, it may suggest the spin loop with long PAUSE inst needs to be tuned accordingly. And some test cases would be helpful to narrow down the problem.

@redknightlois
Copy link

redknightlois commented Aug 18, 2017

@helloguo haven't go to the office yet. Will be there in a few hours.

@redknightlois
Copy link

redknightlois commented Aug 19, 2017

@helloguo @vancem I can confirm with absolute certainty (I believe this is the very first time in my life I can say that) that what we are looking at on our side is this issue at work. We run the same thing on Skylake i7-6700K, Kaby Lake i7-7700 and Ivy Bridge i5-3470.

Results:

  • i7-6700K = 3.5K request/sec
  • i7-7700 = 3.5K request/sec
  • i5-3470 = 12K request/sec

Moreover, we can also confirm with high confidence that CoreCLR 1.1 also suffers from this on a lower degree. We were able to workaround the Wait CPU consumption manually yielding the thread, waiting until we hit the 5th round to call the primitive Wait. That diminish the CPU cost of wait from 30% to 8% in the exact same scenario and allow us to jump from 30K to 43K request/sec on i7-7700.

Hope we can fix this soon, because we cannot ship a Release Candidate with this issue there.

@helloguo
Copy link
Author

@redknightlois Thanks for your detailed numbers. Can you share how to set up and run your workload? So that I could run it against my prototype.

@kouvel
Copy link
Member

kouvel commented Aug 21, 2017

@redknightlois, thanks for the information. One other difference between the above procs could be that the i7s have hyperthreading and the i5 doesn't, and the pause count is multiplied by number of procs, so the i7s could be issuing twice as many pauses. Do you have hyperthreading enabled on those?

Probably the multiplying should be removed anyway.

Also @redknightlois, what spin counts are you using for the problematic ManualResetEventSlims? I don't think it matters but just to test similarly on my side.

@redknightlois
Copy link

@kouvel Yes they both have hyperthreading enabled and the ManualResetEventSlim is created with default parameters. https://github.com/ravendb/ravendb/blob/v4.0/src/Voron/GlobalFlushingBehavior.cs#L80

@kouvel
Copy link
Member

kouvel commented Aug 22, 2017

@helloguo, is there by any chance a way to determine from the processor the cycle count delay incurred by pause? It seems like using the cycle count could be error-prone and I'm inclined to just try reducing the quantity of pauses.

@helloguo
Copy link
Author

@kouvel Off the top of my head, we could use CPUID to check the processors' code name and then return the latency of PAUSE inst based on the processors' code name.

@kouvel @redknightlois I put a prototype here dotnet/coreclr#13527, which uses the recommended method by optimization guide.

@redknightlois
Copy link

@helloguo @kouvel Is there any way you can push to MyGet a version that I could use to link it for CoreCLR 2.0. I am not building it locally.

@helloguo
Copy link
Author

@redknightlois Can you replace path\to\coreclr\packages\microsoft.dotnet.buildtools\2.0.0-prerelease-01914-02\lib\tool-runtime\project.csproj

with

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFrameworks>netcoreapp2.0;net46</TargetFrameworks>
    <EnableDefaultItems>false</EnableDefaultItems>
    <!--<RuntimeFrameworkVersion Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">2.0.0-beta-001776-00</RuntimeFrameworkVersion>-->
    <!--<PackageTargetFallback Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">$(PackageTargetFallback);dnxcore50;portable-net45+win8</PackageTargetFallback>-->
    <OutputType>Exe</OutputType>
  </PropertyGroup>

  <ItemGroup>
    <Compile Include="Program.cs" />
  </ItemGroup>

  <ItemGroup Condition=" '$(TargetFramework)' == 'netcoreapp2.0' ">
    <PackageReference Include="Microsoft.Build.Framework" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Runtime" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Tasks.Core" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build.Utilities.Core" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.Build" Version="15.3.0-preview-000388-01" />
    <PackageReference Include="Microsoft.DotNet.PlatformAbstractions" Version="2.0.0-preview2-25407-01" />
    <PackageReference Include="Microsoft.Net.Compilers.netcore" Version="2.0.0-rc4" />
    <PackageReference Include="Microsoft.Net.Compilers.Targets.NetCore" Version="0.1.5-dev" />
    <PackageReference Include="Microsoft.Cci" Version="4.0.0-rc4-24217-00" />
    <PackageReference Include="System.Composition" Version="1.1.0-preview2-25405-01" />
    <PackageReference Include="Newtonsoft.Json" Version="10.0.3" />
  </ItemGroup>

  <ItemGroup Condition=" '$(TargetFramework)' == 'net46' ">
    <PackageReference Include="Microsoft.Cci" Version="4.0.0-rc4-24217-00" />
    <PackageReference Include="Microsoft.NETCore.Windows.ApiSets" Version="1.0.1" />
    <PackageReference Include="System.IO.FileSystem" Version="4.3.0" />
    <PackageReference Include="System.Diagnostics.FileVersionInfo" Version="4.3.0" />
    <PackageReference Include="System.Diagnostics.TraceSource" Version="4.3.0" />
  </ItemGroup>

</Project>

I cannot build it without this change. see https://github.com/dotnet/coreclr/issues/13025

@kouvel
Copy link
Member

kouvel commented Aug 22, 2017

I'm planning on doing two separate PRs for this, one mainly to remove the "multiply by processor count" which IMO is a low-risk easy-to-explain change, and another to tune the spin counts based on a measure of the delay. I don't think we would want to keep track of processor code names, that would be a new maintenance issue.

@kouvel
Copy link
Member

kouvel commented Aug 22, 2017

@redknightlois:

Is there any way you can push to MyGet a version that I could use to link it for CoreCLR 2.0. I am not building it locally.

I'm hoping the first PR I mentioned above can be merged soon, and maybe you can test with the daily build that contains the change.

@helloguo
Copy link
Author

one mainly to remove the "multiply by processor count" which IMO is a low-risk easy-to-explain change

I assume you are referring the spin in GC part, right? Any plan for the spin in MonitorEnter and other places?

Do you have any benchmarks that can be shared with community? It would be great if we can measure it as well :)

I don't think we would want to keep track of processor code names, that would be a new maintenance issue.

I agree. If we can find a reasonable spin cycles by measuring benchmarks, we can use the "Dynamic Pause Loop Example" https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf to get rid of processor code names.

capture-example

@redknightlois
Copy link

@kouvel I was going to ask the same thing, there has been many changes on sync primitives in CoreCLR 2.0 this should be reviewed all over the place. If I remember correctly @benaadams have done some assembler work too in between 1.1 and 2.0 (correct me if I am wrong).

@kouvel
Copy link
Member

kouvel commented Aug 23, 2017

Regarding removing the multiply by proc count, I was referring to ManualResetEventSlim and another place I found in managed code where it's multiplying the pause count per iteration by proc count, I'm planning on removing those. The multiply by proc count in GC is tuning the spin count with no back-off scheme, so it could probably be changed to use the method you suggested. For the second PR I was going to start with Thread.SpinWait, measuring the delay incurred by the pause and calibrating the delay per iteration issued based on that, and spin loops that use a back-off scheme could use the calibrated version to hopefully get somewhat equivalent behavior on different procs.

It'll have to be reviewed in a bunch of places and expanded, probably one at a time though along with testing perf for what is being changed. I don't have benchmarks yet, was just planning on writing them as I go.

@redknightlois
Copy link

redknightlois commented Aug 23, 2017

@kouvel I created a repro package with some instructions if you want to try your changes on our codebase.

You need to have Bash for Windows (or run on Linux) with wrk installed.

Download: http://chilp.it/6c4620c
Run RavenDB on a SSD or RamDisk
Open the browser on http://locahost:8080
Create a database named: BenchmarkDB (dont forget to do this or you are not going to hit the proper codepath).
Run the content of writes.cmd on bash (will run WRK localhost to generate contention against your local database).

Let me know how it went. If you need further assistance, let me know and we can setup a shared screen call to work on it.

EDIT: Updated the link because for some reason it wasnt working.

kouvel referenced this issue in kouvel/coreclr Aug 23, 2017
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
kouvel referenced this issue in dotnet/coreclr Aug 25, 2017
Related to issue mentioned in https://github.com/dotnet/coreclr/issues/13388
- Multipying the YieldProcessor count by proc count can cause excessive delays that are not fruitful on machines with a large number of procs. Even on a 12-proc machine (6-core), the heuristics as they are without the multiply seem to perform much better.
- The issue above also mentions that the delay of PAUSE on Intel Skylake+ processors have a significantly larger delay (140 cycles vs 10 cycles). Simulating that by multiplying the YieldProcessor count by 14 shows that in both tests tested, it begins crawling at low thread counts.
- I did most of the testing on ManualResetEventSlim, and since Task is using the same spin heuristics, applied the same change there as well.
kouvel referenced this issue in kouvel/coreclr Aug 28, 2017
Related to https://github.com/dotnet/coreclr/issues/13388:
- SpinWait - If SwitchToThread does not switch, Sleep(0) instead since otherwise the iteration is wasted
- Changed ManualResetEventSlim and Task to use SpinWait instead of their custom spinning that was very similar to SpinWait anyway
  - On a single-proc machine, YieldProcessor is ineffective and it will SwitchToThread instead
  - This also removes multiplying the number of YieldProcessor calls per iteration by the number of processors. The multiply could cause artifically long delays on machines with many processors and it would be better to yield/sleep instead. Based on this change, I have tuned the spin counts based that I found to be appropriate from a basic microbenchmark.
kouvel referenced this issue in kouvel/coreclr Aug 31, 2017
Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.
@redknightlois
Copy link

@kouvel I have the collection, but I cannot open the interesting stuff with PerfView itself :D .

StackTrace:
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Windows.Controls.DataGridTextColumn.GenerateElement(DataGridCell cell, Object dataItem)
   at System.Windows.Controls.DataGridCell.BuildVisualTree()
   at System.Windows.Controls.DataGridCell.PrepareCell(Object item, DataGridRow ownerRow, Int32 index)
   at System.Windows.Controls.Primitives.DataGridCellsPresenter.PrepareContainerForItemOverride(DependencyObject element, Object item)

Installing the Windows ADK but probably I wont be able to analyze the results today.

@kouvel
Copy link
Member

kouvel commented Aug 31, 2017

Hmm can you try on another machine when you get a chance? Also check the symbol paths under the File menu and click "Add Microsoft Symbol Server" before analyzing. After opening CPU Stacks, click a row and Ctrl+A to select all and Alt+S to load symbols. Hopefully we can see from that where the CPU time is going.

kouvel referenced this issue in dotnet/coreclr Sep 1, 2017
…13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.
@redknightlois
Copy link

From interaction with some kernel devs, it was noted that the consumption shown there could be a artifact of the measurement itself. So this is the picture I could gather from the interruptions themselves.

image

With the exception of the half second used by a function which contains the SpinLock name in it, no other looks like it is. However I don't know what conclusion to make based on the information that from 60s, 12s are spent on interrupts and DPCs. Does this add anything to the current issue?

@kouvel
Copy link
Member

kouvel commented Sep 1, 2017

If you're seeing a significant amount of time spent in context switching, I would suggest passing in a higher spin count to ManualResetEventSlim (assuming that's the source of the context switches) to see if that improves the overall performance. Lots of context switching is not necessarily worse than lots of spinning but usually it is worse, it depends on how much CPU time would be taken away from other work (if any) by spinning.

@kouvel
Copy link
Member

kouvel commented Sep 1, 2017

That said, MRES has a threshold after which it goes into a Sleep(1), which can significantly reduce latency. May want to avoid that threshold, but worth a try. You could also try using ManualResetEvent (not slim) with your own custom spinning for the purpose of experimentation.

@kouvel
Copy link
Member

kouvel commented Sep 1, 2017

Could also try ManualResetEvent (not slim) by itself (no spinning), as Yield/Sleep(0) can also contribute to context switching

@redknightlois
Copy link

redknightlois commented Sep 5, 2017

OK. After much analysis work I can conclude and confirm:

  • With the latest commit (03bf95c) on CoreCLR 2.1 branch, spin on ManualResetEventSlim is no longer an issue even though we are particularly limited by Kestrel PInvokes on Libuv & Async machinery.
  • Latest commit have higher throughput than intermediate solution.
  • High certainty that kernel measurements are artifacts (kernel waits are not cheap, but nothing we didn't know about).
  • Windows Performance Recorder makes it hard (unless there is a way I didn't find) to filter out all instances of System\ETW Overhead from the measurements :)

So the fix LGTM. cc @helloguo

dotnet-bot referenced this issue in dotnet/corert Sep 6, 2017
…#13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas referenced this issue in dotnet/corert Sep 6, 2017
…#13670)

* Add normalized equivalent of YieldProcessor, retune some spin loops

Part of fix for https://github.com/dotnet/coreclr/issues/13388

Normalized equivalent of YieldProcessor
- The delay incurred by YieldProcessor is measured once lazily at run-time
- Added YieldProcessorNormalized that yields for a specific duration (the duration is approximately equal to what was measured for one YieldProcessor on a Skylake processor, about 125 cycles). The measurement calculates how many YieldProcessor calls are necessary to get a delay close to the desired duration.
- Changed Thread.SpinWait to use YieldProcessorNormalized

Thread.SpinWait divide count by 7 experiment
- At this point I experimented with changing Thread.SpinWait to divide the requested number of iterations by 7, to see how it fares on perf. On my Sandy Bridge processor, 7 * YieldProcessor == YieldProcessorNormalized. See numbers in PR below.
- Not too many regressions, and the overall perf is somewhat as expected - not much change on Sandy Bridge processor, significant improvement on Skylake processor.
  - I'm discounting the SemaphoreSlim throughput score because it seems to be heavily dependent on Monitor. It would be more interesting to revisit SemaphoreSlim after retuning Monitor's spin heuristics.
  - ReaderWriterLockSlim seems to perform worse on Skylake, the current spin heuristics are not translating well

Spin tuning
- At this point, I abandoned the experiment above and tried to retune spins that use Thread.SpinWait
- General observations
  - YieldProcessor stage
    - At this stage in many places we're currently doing very long spins on YieldProcessor per iteration of the spin loop. In the last YieldProcessor iteration, it amounts to about 70 K cycles on Sandy Bridge and 512 K cycles on Skylake.
    - Long spins on YieldProcessor don't let other work run efficiently. Especially when many scheduled threads all issue a long YieldProcessor, a significant portion of the processor can go unused for a long time.
    - Long spins on YieldProcessor is in some cases helping to reduce contention in high-contention cases, effectively taking away some threads into a long delay. Sleep(1) works much better but has a much higher delay so it's not always appropriate. In other cases, I found that it's better to do more iterations with a shorter YieldProcessor. It would be even better to reduce the contention in the app or to have a proper wait in the sync object, where appropriate.
    - Updated the YieldProcessor measurement above to calculate the number of YieldProcessorNormalized calls that amount to about 900 cycles (this was tuned based on perf), and modified SpinWait's YieldProcessor stage to cap the number of iterations passed to Thread.SpinWait. Effectively, the first few iterations have a longer delay than before on Sandy Bridge and a shorter delay than before on Skylake, and the later iterations have a much shorter delay than before on both.
  - Yield/Sleep(0) stage
    - Observed a couple of issues:
      - When there are no threads to switch to, Yield and Sleep(0) become no-op and it turns the spin loop into a busy-spin that may quickly reach the max spin count and cause the thread to enter a wait state, or may just busy-spin for longer than desired before a Sleep(1). Completing the spin loop too early can cause excessive context switcing if a wait follows, and entering the Sleep(1) stage too early can cause excessive delays.
      - If there are multiple threads doing Yield and Sleep(0) (typically from the same spin loop due to contention), they may switch between one another, delaying work that can make progress.
    - I found that it works well to interleave a Yield/Sleep(0) with YieldProcessor, it enforces a minimum delay for this stage. Modified SpinWait to do this until it reaches the Sleep(1) threshold.
  - Sleep(1) stage
    - I didn't see any benefit in the tests to interleave Sleep(1) calls with some Yield/Sleep(0) calls, perf seemed to be a bit worse actually. If the Sleep(1) stage is reached, there is probably a lot of contention and the Sleep(1) stage helps to remove some threads from the equation for a while. Adding some Yield/Sleep(0) in-between seems to add back some of that contention.
      - Modified SpinWait to use a Sleep(1) threshold, after which point it only does Sleep(1) on each spin iteration
    - For the Sleep(1) threshold, I couldn't find one constant that works well in all cases
      - For spin loops that are followed by a proper wait (such as a wait on an event that is signaled when the resource becomes available), they benefit from not doing Sleep(1) at all, and spinning in other stages for longer
      - For infinite spin loops, they usually seemed to benefit from a lower Sleep(1) threshold to reduce contention, but the threshold also depends on other factors like how much work is done in each spin iteration, how efficient waiting is, and whether waiting has any negative side-effects.
      - Added an internal overload of SpinWait.SpinOnce to take the Sleep(1) threshold as a parameter
- SpinWait - Tweaked the spin strategy as mentioned above
- ManualResetEventSlim - Changed to use SpinWait, retuned the default number of iterations (total delay is still significantly less than before). Retained the previous behavior of having Sleep(1) if a higher spin count is requested.
- Task - It was using the same heuristics as ManualResetEventSlim, copied the changes here as well
- SemaphoreSlim - Changed to use SpinWait, retuned similarly to ManualResetEventSlim but with double the number of iterations because the wait path is a lot more expensive
- SpinLock - SpinLock was using very long YieldProcessor spins. Changed to use SpinWait, removed process count multiplier, simplified.
- ReaderWriterLockSlim - This one is complicated as there are many issues. The current spin heuristics performed better even after normalizing Thread.SpinWait but without changing the SpinWait iterations (the delay is longer than before), so I left this one as is.
- The perf (see numbers in PR below) seems to be much better than both the baseline and the Thread.SpinWait divide by 7 experiment
  - On Sandy Bridge, I didn't see many significant regressions. ReaderWriterLockSlim is a bit worse in some cases and a bit better in other similar cases, but at least the really low scores in the baseline got much better and not the other way around.
  - On Skylake, some significant regressions are in SemaphoreSlim throughput (which I'm discounting as I mentioned above in the experiment) and CountdownEvent add/signal throughput. The latter can probably be improved later.

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@kouvel
Copy link
Member

kouvel commented Sep 6, 2017

Thanks @redknightlois. How much improvement are you seeing from dotnet/coreclr@03bf95c? That change has more risk of regressions so I wasn't considering porting it back to 2.0 but if there is a significant improvement in some real scenarios, wondering if it would be worth a try.

@redknightlois
Copy link

@kouvel we jumped from 27K request/sec to 30K request per sec, so that's roughly 10% improvement.

@redknightlois
Copy link

Any ETA on the service release now that the 2.0.1 is scheduled to be (if not already) out in the open?

@kouvel
Copy link
Member

kouvel commented Sep 28, 2017

I'm not sure about the cadence of .NET Core servicing releases, @karelz, is there a rough timeframe for 2.0.2 now?

@karelz
Copy link
Member

karelz commented Sep 28, 2017

@leecow do we have rough ETA for 2.0.1 release?
Is it something we could put on roadmap? (with caveat that ETAs can change)

@kouvel
Copy link
Member

kouvel commented Sep 28, 2017

I think @redknightlois is looking for 2.0.2, as the fix was available when 2.0.1 was almost finalized

@helloguo helloguo changed the title Spin count tuning Spin wait tuning Sep 28, 2017
@redknightlois
Copy link

redknightlois commented Sep 29, 2017

@karelz @kouvel Yes, looking for the 2.0.2 release :)

@kouvel kouvel removed their assignment Oct 31, 2017
@redknightlois
Copy link

Shouldnt this be closed?

@kouvel
Copy link
Member

kouvel commented May 16, 2018

Yes some spinning cases remain but they are covered by other issues

@kouvel kouvel closed this as completed May 16, 2018
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@danmoseley
Copy link
Member

@StephenBonikowsky this issue is closed but it went on the partner board. If there's a partner asking for work here, can you please help them create a new issue.

@StephenBonikowsky
Copy link
Member

I followed-up with the customer, what they were reporting was this issue but in their .NET Framework service.
I've removed the project link.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests