Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

Closed
kouvel opened this issue Sep 14, 2017 · 18 comments
Closed

Spinning in GC: Consider switching to using YieldProcessorNormalized() #8936

kouvel opened this issue Sep 14, 2017 · 18 comments
Assignees
Labels
area-GC-coreclr tenet-performance Performance related issue
Milestone

Comments

@kouvel
Copy link
Member

kouvel commented Sep 14, 2017

Consider switching to using Thread::YieldProcessorNormalized() instead of YieldProcessor() (after calling Thread::EnsureYieldProcessorNormalizedInitialized() before the spin loop)

  • Issue https://github.com/dotnet/coreclr/issues/13388 mentions that from Intel Skylake architecture, the pause delay has been significantly increased, so any spin loops that do a large number of YieldProcessor() calls per spin iteration, or spin loops that do a large number of spin iterations over YieldProcessor() (such as enter_spin_lock), may be significantly less efficient on Skylake and newer processors
  • PR Add normalized equivalent of YieldProcessor, retune some spin loops coreclr#13670 added Thread::YieldProcessorNormalized() to measure and normalize the delay per call across different processors. Since the change in the processor would require retuning spin loops such as above based on the processor, the idea is that switching to YieldProcessorNormalized() will also require retuning but only once and from then hopefully the spin loop would work similarly on different processors.
  • YieldProcessorNormalizedWithBackOff() increases the delay up to a maximum (also scaled for the processor) that translates to a delay of about 900 cycles. The observation based on other testing was that significantly beyond that, Sleep(0) typically produces more efficient spinning by letting other threads run. This may or may not be useful, but it's there if it's found to be useful.
  • For instance, in enter_spin_lock with server GC, the max spin count on a 4-core (8-thread) Skylake processor would be equivalent to a total delay of 1 M cycles, which is about 300 us. On a pre-Skylake processor with the same core/thread count, the total delay is about 140 K cycles, which is about 40 us.
  • See Add normalized equivalent of YieldProcessor, retune some spin loops coreclr#13670 for more details on investigations and issues discovered with SpinWait's and other spin-waiting schemes in the libraries. Some ideas from there may also be employed to improve spinning.
@swgillespie
Copy link
Contributor

cc @Maoni0

@Maoni0
Copy link
Member

Maoni0 commented Sep 14, 2017

I would like to use this in GC except it can't be a method off of the Thread class as we'd want to use it on our native threads too (like the Server GC threads). There's nothing specific about this that requires it to be on the Thread class. Could we move it off of the Thread and make it into a generic utility function?

Also cc @helloguo (PR#13527)

@kouvel
Copy link
Member Author

kouvel commented Sep 14, 2017

Sure I can move it out to a global function as part of https://github.com/dotnet/coreclr/issues/13984. Is there any specific place it needs to be for the GC to be able to use it?

@Maoni0
Copy link
Member

Maoni0 commented Sep 14, 2017

If you are going to move this to the finalizer thread, I'd imagine you just have a default value of 1 for s_yieldsPerNormalizedYield (and change the check in IsInitialized obviously), that way YieldProcessorNormalized can always be called, just wouldn't be called with an optimal value till init is done on newer chips. Then GC (or anything else) can always call YieldProcessNormalized, before you do the actual calucation for s_yieldsPerNormalizedYield and doesn't need to worry about when it's initialized.

@kouvel
Copy link
Member Author

kouvel commented Sep 15, 2017

I was planning on using the current values on a Skylake processor for the defaults before it's initialized (1 yield == 1 normalized yield, 7 for the max per spin iteration iirc). And yes I'm also planning on removing EnsureYieldProcessorNormalizedInitialized and having YieldProcessorNormalized itself request initialization if not already done so, while using the defaults meanwhile.

@Maoni0
Copy link
Member

Maoni0 commented Sep 15, 2017

Maybe I'm missing something but I don't know why you would want YieldProcessorNormalized itself to request initialization at all - why incur a 10ms pause in the middle of something when you can have the 10ms done on the finalizer thread (most of the time) on a different CPU? I thought that's what dotnet/coreclr#13984 is about. The finalizer thread would just run the init and whenever it's finished the YieldProcessorNormalized calls will use the value set by the init. So the calls to YieldProcessorNormalized in first 10ms or so may not use the more optimal value but that's ok. After that they will.

@kouvel
Copy link
Member Author

kouvel commented Sep 15, 2017

By requesting initialization I just meant to trigger the background initialization without waiting for it to complete, what you described matches what I was thinking

@Maoni0
Copy link
Member

Maoni0 commented Sep 15, 2017

Just to clarify - what I meant was to have YieldProcessorNormalized itself not request initialization at all; instead you just queue the work to the finalizer thread as part of the EE startup and whenever that's done YieldProcessorNormalized will start using the more optimized value computed. If this is the same as what you were thinking too that's great.

@kouvel
Copy link
Member Author

kouvel commented Sep 15, 2017

Oh ok, no that's not what I was thinking. I guess that makes sense, with the GC using it, the initialization will pretty much have to happen at some point in early stages of the app, so there probably won't be much benefit from lazy-initializing based on use.

@Maoni0
Copy link
Member

Maoni0 commented Sep 15, 2017

Yep :-)

@davmason davmason self-assigned this Mar 23, 2018
@davmason
Copy link
Member

@Maoni0 @kouvel I'm looking in to this now, it seems like the easiest way to do this would be to add a YieldProcessor method to the GCToEEInterface, then call YieldProcessorNormalized from within the EE. There would be some overhead, but it seems ok when the whole point is waiting.

To switch in standalone GC without adding an interface method, we would have to duplicate all of the YieldProcessorNormalized code on the GC side, and I would prefer to avoid code duplication.

Does that line up with what you two are thinking?

@Maoni0
Copy link
Member

Maoni0 commented Mar 27, 2018

yes please add a method on the interface for this.

@kouvel
Copy link
Member Author

kouvel commented Mar 27, 2018

The initialization is done in the finalizer thread at the beginning. If you want to avoid the overhead it may be possible to have an interface call to get the resulting values from initialization somehow and just clone the Yield functions.

@kouvel
Copy link
Member Author

kouvel commented Mar 27, 2018

Also FYI:

  • If longer fixed-iteration spin loops were previously tuned for pre-Skylake processors, they would currently be spinning for longer on Skylake processors
  • YieldProcessorNormalized() normalizes the delay per pause iteration to something close to that on Skylake processors, so switching to that would increase the spinning duration on pre-Skylake processors.
  • It may be necessary to retune the longer spin loops once based on perf comparisons

@Maoni0
Copy link
Member

Maoni0 commented Mar 27, 2018

this Yield function?

FORCEINLINE void YieldProcessorNormalized(const YieldProcessorNormalizationInfo &normalizationInfo)
{
    LIMITED_METHOD_CONTRACT;

    unsigned int n = normalizationInfo.yieldsPerNormalizedYield;
    _ASSERTE(n != 0);
    do
    {
        YieldProcessor();
    } while (--n != 0);
}

and you are suggesting the interface just returns g_yieldsPerNormalizedYield (and GC's yield function would just loop g_yieldsPerNormalizedYield times)?

@kouvel
Copy link
Member Author

kouvel commented Mar 27, 2018

Yea that's what I was thinking. If there are spin loops with exponential back-off and if you want to use it there's another variant of YieldProcessorNormalized() for that which takes another piece of initialization data.

@Maoni0
Copy link
Member

Maoni0 commented Mar 29, 2018

I am pushing this to post 2.1 as we need to concentrate on getting LocalGC to work and this is not essential to have for 2.1 ZBB.

@Maoni0
Copy link
Member

Maoni0 commented Jan 30, 2019

this was fixed by dotnet/coreclr#21523

@Maoni0 Maoni0 closed this as completed Jan 30, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants