Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

kouvel · 2017-09-14T18:50:40Z

Consider switching to using Thread::YieldProcessorNormalized() instead of YieldProcessor() (after calling Thread::EnsureYieldProcessorNormalizedInitialized() before the spin loop)

Issue https://github.com/dotnet/coreclr/issues/13388 mentions that from Intel Skylake architecture, the pause delay has been significantly increased, so any spin loops that do a large number of YieldProcessor() calls per spin iteration, or spin loops that do a large number of spin iterations over YieldProcessor() (such as enter_spin_lock), may be significantly less efficient on Skylake and newer processors
PR Add normalized equivalent of YieldProcessor, retune some spin loops coreclr#13670 added Thread::YieldProcessorNormalized() to measure and normalize the delay per call across different processors. Since the change in the processor would require retuning spin loops such as above based on the processor, the idea is that switching to YieldProcessorNormalized() will also require retuning but only once and from then hopefully the spin loop would work similarly on different processors.
YieldProcessorNormalizedWithBackOff() increases the delay up to a maximum (also scaled for the processor) that translates to a delay of about 900 cycles. The observation based on other testing was that significantly beyond that, Sleep(0) typically produces more efficient spinning by letting other threads run. This may or may not be useful, but it's there if it's found to be useful.
For instance, in enter_spin_lock with server GC, the max spin count on a 4-core (8-thread) Skylake processor would be equivalent to a total delay of 1 M cycles, which is about 300 us. On a pre-Skylake processor with the same core/thread count, the total delay is about 140 K cycles, which is about 40 us.
See Add normalized equivalent of YieldProcessor, retune some spin loops coreclr#13670 for more details on investigations and issues discovered with SpinWait's and other spin-waiting schemes in the libraries. Some ideas from there may also be employed to improve spinning.

swgillespie · 2017-09-14T19:10:08Z

cc @Maoni0

Maoni0 · 2017-09-14T22:57:08Z

I would like to use this in GC except it can't be a method off of the Thread class as we'd want to use it on our native threads too (like the Server GC threads). There's nothing specific about this that requires it to be on the Thread class. Could we move it off of the Thread and make it into a generic utility function?

Also cc @helloguo (PR#13527)

kouvel · 2017-09-14T22:59:51Z

Sure I can move it out to a global function as part of https://github.com/dotnet/coreclr/issues/13984. Is there any specific place it needs to be for the GC to be able to use it?

Maoni0 · 2017-09-14T23:25:00Z

If you are going to move this to the finalizer thread, I'd imagine you just have a default value of 1 for s_yieldsPerNormalizedYield (and change the check in IsInitialized obviously), that way YieldProcessorNormalized can always be called, just wouldn't be called with an optimal value till init is done on newer chips. Then GC (or anything else) can always call YieldProcessNormalized, before you do the actual calucation for s_yieldsPerNormalizedYield and doesn't need to worry about when it's initialized.

kouvel · 2017-09-15T00:19:17Z

I was planning on using the current values on a Skylake processor for the defaults before it's initialized (1 yield == 1 normalized yield, 7 for the max per spin iteration iirc). And yes I'm also planning on removing EnsureYieldProcessorNormalizedInitialized and having YieldProcessorNormalized itself request initialization if not already done so, while using the defaults meanwhile.

Maoni0 · 2017-09-15T00:29:59Z

Maybe I'm missing something but I don't know why you would want YieldProcessorNormalized itself to request initialization at all - why incur a 10ms pause in the middle of something when you can have the 10ms done on the finalizer thread (most of the time) on a different CPU? I thought that's what dotnet/coreclr#13984 is about. The finalizer thread would just run the init and whenever it's finished the YieldProcessorNormalized calls will use the value set by the init. So the calls to YieldProcessorNormalized in first 10ms or so may not use the more optimal value but that's ok. After that they will.

kouvel · 2017-09-15T01:53:25Z

By requesting initialization I just meant to trigger the background initialization without waiting for it to complete, what you described matches what I was thinking

Maoni0 · 2017-09-15T19:33:11Z

Just to clarify - what I meant was to have YieldProcessorNormalized itself not request initialization at all; instead you just queue the work to the finalizer thread as part of the EE startup and whenever that's done YieldProcessorNormalized will start using the more optimized value computed. If this is the same as what you were thinking too that's great.

kouvel · 2017-09-15T19:36:13Z

Oh ok, no that's not what I was thinking. I guess that makes sense, with the GC using it, the initialization will pretty much have to happen at some point in early stages of the app, so there probably won't be much benefit from lazy-initializing based on use.

Maoni0 · 2017-09-15T19:38:19Z

Yep :-)

davmason · 2018-03-27T22:54:41Z

@Maoni0 @kouvel I'm looking in to this now, it seems like the easiest way to do this would be to add a YieldProcessor method to the GCToEEInterface, then call YieldProcessorNormalized from within the EE. There would be some overhead, but it seems ok when the whole point is waiting.

To switch in standalone GC without adding an interface method, we would have to duplicate all of the YieldProcessorNormalized code on the GC side, and I would prefer to avoid code duplication.

Does that line up with what you two are thinking?

Maoni0 · 2018-03-27T22:57:34Z

yes please add a method on the interface for this.

kouvel · 2018-03-27T22:58:14Z

The initialization is done in the finalizer thread at the beginning. If you want to avoid the overhead it may be possible to have an interface call to get the resulting values from initialization somehow and just clone the Yield functions.

kouvel · 2018-03-27T23:08:08Z

Also FYI:

If longer fixed-iteration spin loops were previously tuned for pre-Skylake processors, they would currently be spinning for longer on Skylake processors
YieldProcessorNormalized() normalizes the delay per pause iteration to something close to that on Skylake processors, so switching to that would increase the spinning duration on pre-Skylake processors.
It may be necessary to retune the longer spin loops once based on perf comparisons

Maoni0 · 2018-03-27T23:10:20Z

this Yield function?

FORCEINLINE void YieldProcessorNormalized(const YieldProcessorNormalizationInfo &normalizationInfo)
{
    LIMITED_METHOD_CONTRACT;

    unsigned int n = normalizationInfo.yieldsPerNormalizedYield;
    _ASSERTE(n != 0);
    do
    {
        YieldProcessor();
    } while (--n != 0);
}

and you are suggesting the interface just returns g_yieldsPerNormalizedYield (and GC's yield function would just loop g_yieldsPerNormalizedYield times)?

kouvel · 2018-03-27T23:13:37Z

Yea that's what I was thinking. If there are spin loops with exponential back-off and if you want to use it there's another variant of YieldProcessorNormalized() for that which takes another piece of initialization data.

Maoni0 · 2018-03-29T22:07:45Z

I am pushing this to post 2.1 as we need to concentrate on getting LocalGC to work and this is not essential to have for 2.1 ZBB.

Maoni0 · 2019-01-30T23:07:17Z

this was fixed by dotnet/coreclr#21523

davmason self-assigned this Mar 23, 2018

Maoni0 closed this as completed Jan 30, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

kouvel commented Sep 14, 2017

swgillespie commented Sep 14, 2017

Maoni0 commented Sep 14, 2017

kouvel commented Sep 14, 2017

Maoni0 commented Sep 14, 2017 •

edited

Loading

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

davmason commented Mar 27, 2018

Maoni0 commented Mar 27, 2018

kouvel commented Mar 27, 2018

kouvel commented Mar 27, 2018

Maoni0 commented Mar 27, 2018

kouvel commented Mar 27, 2018

Maoni0 commented Mar 29, 2018

Maoni0 commented Jan 30, 2019

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

Comments

kouvel commented Sep 14, 2017

swgillespie commented Sep 14, 2017

Maoni0 commented Sep 14, 2017

kouvel commented Sep 14, 2017

Maoni0 commented Sep 14, 2017 • edited Loading

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

kouvel commented Sep 15, 2017

Maoni0 commented Sep 15, 2017

davmason commented Mar 27, 2018

Maoni0 commented Mar 27, 2018

kouvel commented Mar 27, 2018

kouvel commented Mar 27, 2018

Maoni0 commented Mar 27, 2018

kouvel commented Mar 27, 2018

Maoni0 commented Mar 29, 2018

Maoni0 commented Jan 30, 2019

Maoni0 commented Sep 14, 2017 •

edited

Loading