Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process.ProcessorAffinity returns a 32-bit value on ARM64 #28051

Closed
adamsitnik opened this issue Dec 4, 2018 · 19 comments
Closed

Process.ProcessorAffinity returns a 32-bit value on ARM64 #28051

adamsitnik opened this issue Dec 4, 2018 · 19 comments

Comments

@adamsitnik
Copy link
Member

When we run the following code on ARM64 machine with 48 cores (Ubuntu) without setting the CPU affinity in explicit way:

Console.WriteLine(System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity);

We get 00000000FFFFFFFF which is 2^32 - 1 while it should be 2^48 - 1

I don't know if this is specific to ARM64 or Ubuntu or 64 bit in general. I just don't have an access to a machine with more than 32 core to test.

@AndyAyersMS have hit this issue when he was benchmarking .NET Core 3.0 on ARM

@danmoseley
Copy link
Member

Was it possible that 16 of the cores were asleep? As mentioned here
https://github.com/mono/mono/blob/17aa73d0ae216529b59d5aa3d0ee68007bb035c4/mono/utils/mono-proclib.c#L717-L720

@danmoseley
Copy link
Member

Looks like the CLR already fixed this (when getting core count for GC)

dotnet/coreclr#18053
dotnet/coreclr#18289

As for Environment.GetProcessorCount it eventually gets to
https://github.com/dotnet/coreclr/blob/e54dffef08c22a94962aacd93d4793b377cac632/src/classlibnative/bcltype/system.cpp#L237
and then it gets complicated. I am not sure whether it is susceptible to this problem on ARM and needs a fix also.

@danmoseley
Copy link
Member

@janvorli

@AndyAyersMS
Copy link
Member

One place this shows up is in the spectralnorm-3 perf test, which uses Environment.ProcessorCount to figure out how to partition work. It currently runs 700x slower on the arm64 HW than on an 8 core Skylake... I think this is a combination of restrictive affinity, higher synchronization costs (and maybe excessive spinning?), and more frequent synchronization.

Console.WriteLine($"ProcessArchitecture={System.Runtime.InteropServices.RuntimeInformation.ProcessArchitecture}");
Console.WriteLine($"IntPtrSize = {IntPtr.Size}");
ulong a = (ulong) System.Diagnostics.Process.GetCurrentProcess().ProcessorAffinity;
Console.WriteLine("ProcessorAffinity = " + a.ToString("X16"));
Console.WriteLine($"ProcessorCount = {Environment.ProcessorCount}");

gives

ProcessArchitecture = Arm64
IntPtrSize = 8
ProcessorAffinity = 00000000FFFFFFFF
ProcessorCount = 46

@janvorli
Copy link
Member

janvorli commented Dec 4, 2018

I can see what's going on. The System.Diagnostics.Process.ProcessorAffinity is implemented on Linux in CoreFX in SystemNative_SchedGetAffinity function. And there is a bug in there. See the 1u in bits |= (1u << cpu); below. The u suffix means 32 bit and so the value returned will always have max 32 bits set.

        intptr_t bits = 0;
        for (int cpu = 0; cpu < maxCpu; cpu++)
        {
            if (CPU_ISSET(cpu, &set))
            {
                bits |= (1u << cpu);
            }
        }

It should be e.g. bits |= ((intptr_t)1) << cpu; instead.

@janvorli
Copy link
Member

janvorli commented Dec 5, 2018

I've created a PR with a fix: dotnet/corefx#33825

@adamsitnik
Copy link
Member Author

One place this shows up is in the spectralnorm-3 perf test, which uses Environment.ProcessorCount to figure out how to partition work. It currently runs 700x slower on the arm64 HW than on an 8 core Skylake... I think this is a combination of restrictive affinity, higher synchronization costs (and maybe excessive spinning?), and more frequent synchronization.

With the fix from @janvorli we can remove "restrictive affinity" from the list.

I just run spectralnorm-3 on my 12 core x64 Windows machine (6 real cores, HTx2) with Concurrency Visualizer Profiler enabled ( --profiler CV in the perf repo).

The CV tells me that on Windows we use up to 5 cores out of 12:

image

And that 90% of the time is spent for synchronization:

image

So @AndyAyersMS you are most probably right about higher synchronization costs on ARM.

@adamsitnik
Copy link
Member Author

I just read the code of the benchmark. As of today, it creates an array of just 100 doubles and divides it into Environment.ProcessorCount-many parallel tasks.

100 is a very small input here. When I set it to 10000 the CPU consumption is totally different:

image

and it spends less time for synchronizaiton:

image

So it looks like that as of today this benchmark is measuring the perf of synchronization?

@adamsitnik
Copy link
Member Author

Was it possible that 16 of the cores were asleep?

@danmosemsft I did not know that this is possible. I guess that this is why ARM is having lower energy consumption.

@brianrob if we ever run the perf tests for ARM we need to make sure this setting is off

@danmoseley
Copy link
Member

I opened dotnet/corefx#33838 for the other issue

@danmoseley
Copy link
Member

So @AndyAyersMS you are most probably right about higher synchronization costs on ARM.

@kouvel are you/have you looked at synchronization costs on ARM? I am not famliar with this benchmark but 90% of the time is spent for synchronization sounds not ideal.

@janvorli
Copy link
Member

janvorli commented Dec 5, 2018

@danmosemsft this is arm64, not arm. I guess you meant that, but wanted to make it clear for others.

@danmoseley
Copy link
Member

@janvorli thanks, it's confusing that we often call 32 bit ARM just ARM. Is it common for 32 and 64 bit ARM to have quite different performance issues?

@janvorli
Copy link
Member

janvorli commented Dec 5, 2018

@danmosemsft they have different instruction set with different performance characteristics, so I would expect them to be quite different. Btw, the official naming is ARM and ARM64.

@kouvel
Copy link
Member

kouvel commented Dec 5, 2018

I haven't looked at synchronization costs on arm64 and I'm not aware of issues related to synchronization there. I suspect that spectralnorm-3 would be mostly measuring barrier.SignalAndWait() and that would scale worse with number of processors but 700x seems odd. Would have to try a repro to see.

@kouvel
Copy link
Member

kouvel commented Dec 5, 2018

If anyone has ready access to an Ubuntu arm64 machine, could you please run the following code? Just want to make sure the thread pool is also getting the correct number of processors, otherwise it would explain why it would be so slow.

        Console.WriteLine($"ProcessorCount: {Environment.ProcessorCount}");
        int w, c;
        ThreadPool.GetMinThreads(out w, out c);
        Console.WriteLine($"ThreadPool min thread counts: {w}, {c}");

@AndyAyersMS
Copy link
Member

Last line is w & c

ProcessArchitecture = Arm64
IntPtrSize = 8
ProcessorAffinity = 00000000FFFFFFFF
ProcessorCount = 46
46 46

@kouvel
Copy link
Member

kouvel commented Dec 5, 2018

Thanks, that looks correct, I'll take a closer look.

@danmoseley
Copy link
Member

@adamsitnik it might make sense to move this to a new issue

jlennox referenced this issue in jlennox/corefx Dec 16, 2018
The function was incorrectly using unsigned int constant 1 as a value
that is shifted as a mask for each processor present or-ed to the final
mask. So on machines with more cores than 32, it was returning max 32
set bits.
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants