-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hwloc 1.11.7 issue on win2012 after enabling processor group #260
Comments
Hello It looks your coreinfo output didn't get uploaded correctly, so I have to guess. Dividing 6-core processors in groups of 4 might be a bad idea if windows puts cores of different processors in a same group. That might be what happens here. Try with 3 instead of 4 to see what happens first. If windows does strange things like above, we may have a just ignore processor groups in such cases. Not sure how to detect this... |
Hello, Sorry, here is the coreinfo -g output. I have tested many group value - 2, 3, 4 and 6. If I set group value to 2 and 4, hwloc has above issue. For group value 3 and 6, after set and reboot it has no effect on Windows system and I cannot find any error in events. Our customer also tested this on their environment with processor group value 8 on Windows 10 and it has same issue. It seems processor group is used to support more than 64 processors on Windows platform and I think it's important for MPI environment, although on the host which has less than 64 processor I agree that no need to use processor group. C:\hwloc-win64-build-1.11.7\bin>coreinfo -g Coreinfo v3.31 - Dump information on system CPU and memory topology Logical Processor to Group Map: |
To be clear, processor groups are supported by hwloc already. Microsoft verified on a 160-core machine. If I remember correctly, we had 2 groups with 60 PUs each (3 sockets with 10 HT cores each), and a third group with 40 PUs (last 2 sockets). Although I use bcdedit to test our processor group support on my small windows machine, it should be used carefully because windows creates virtual groups that cross existing resource boundaries (group 1 is across 2 packages in your examples). Not sure what happens with value 2, that one should work.... Anyway, have you seen an issue on a real machine with processor groups that were not created by bcdedit? |
Thanks for your information. I think my answer is no. We don't have such environment. This issue was found by a customer and his environment has been configured by bcdedit too. For my understanding we always need to use bcdedit to set group size even on the host which has more than 64 processors unless we want to use default group size. |
OK. If that customers used bcdedit on a large machine where processor groups are required anyway, that becomes interesting. Can you get information about what groupsize was used on what kind of machine (processor model + how many of them) ? |
This is the best update we have from the customer that describes their initial environment that shows the problem. If there is specific information that you need, please post that here, and I will ask for that information. ================ We have a Windows (10) workstation configured with 2 Processor Groups. bcdedit.exe /set groupsize 8 They can be displayed using the coreinfo command: Logical Processor to Group Map: Group 1:All cores on the first socket are in group-0 and all cores on the The application is then launched: "C:\Program Files (x86)\IBM\Platform-MPI\bin\mpirun.exe" -prot - which results in
Host 0 -- ip 192.168.119.125 -- ranks 0 - 15 host | 0 Prot - All Intra-node communication is: SHM Host 0 -- ip 192.168.119.125 -- [0 1 2 3 4 5 6 7][8 9 10 11 12 13 14 15]
It seems like:
|
"Package (cpuset 0x0000000f,,0x0000000f)" seems to say that this socket was split across two processor groups. "L3 (cpuset 0x000000ff)" says that this L3 is inside a single processor group. They should have the same cpuset. To be sure, it would be good to set HWLOC_COMPONENTS=-x86 in the environment, in case our x86-backend doesn't like these windows processor groups. As said above, "coreinfo -cgns" might help. Put the output between triple-backquotes so that Github doesn't break the formatting. Also please post the output of lstopo with HWLOC_COMPONENTS=-x86 set. Regarding process binding, I need to know which hwloc API was used for binding. And the output of "hwloc-info --support" would help too. Basically, Windows support thread binding but process binding is messy. You'll need hwloc 1.11.3 for the "hwloc-info --support", but you should upgrade to 1.11.7 anyway, 1.11.1 is very old. |
Hi Brice, Currently we only have a two sockets system 6 cores on each socket to do testing. Only when I configure processor group size to 4 I can reproduce the customer's issue. When I use the hwloc 1.11.1 and 1.11.7 the result is same. According to your suggestion I did following test using hwloc 1.11.7. Here is the result. Before configure processor group the coreinfo output: C:>coreinfo -cgns Coreinfo v3.31 - Dump information on system CPU and memory topology Logical to Physical Processor Map: Logical Processor to Socket Map: Logical Processor to NUMA Node Map: Logical Processor to Group Map: After configure the processor group size to 4, here is the hwloc-info --support output (with the environment variable HWLOC_COMPONENTS=-x86 set). C:\hwloc-win64-build-1.11.7\bin>hwloc-info --support Here is lstopo output and screenshot: C:\hwloc-win64-build-1.11.7\bin>lstopo Keyboard shortcuts: The coreinfo output after set processor group size to 4. C:\hwloc-win64-build-1.11.7\bin>coreinfo -cgns Coreinfo v3.31 - Dump information on system CPU and memory topology Logical to Physical Processor Map: Logical Processor to Socket Map: Logical Processor to NUMA Node Map: Logical Processor to Group Map: |
As already explained, groupsize 4 on a machine with 6-core sockets doesn't make sense because 6 isn't divisible by 4. Add to that the fact that windows creates groups in a totally crazy way. See your section "Logical Processor to Socket Map" of coreinfo, it now shows 8 sockets, with socket 0 and 6 spanning across 2 different groups. Things shouldn't be that crazy on your customer machine since it divides 8-core processors in groups of 8. coreinfo -cgns would confirm that. By the way, the customer should also explain why he wants to create such useless groups :/ |
More precisely, what does not make sense in using groupsize 4 is that it ends up creating a group which spans over the two sockets without including them all. This is a very odd thing to do, since it artificially brings together 4 cores which are not actually related since they are in different sockets. |
I have asked for the "core info -cgns" output, and will post that when I receive the information. The specific group sizes we are testing with are deliberately artificial to take advantage of available machines with small(er) core counts -- but still testing multiple processor group configurations. If that is not generally supported, that is OK -- but it would be nice to know, so that we can write an appropriate restriction for our product. Is testing with arbitrary processor group assignments a likely cause of the issues we are seeing? |
Arbitrary processor group assignments are the cause of some issues above. But I can't say for sure nothing else is broken. Let's take an example for explaining a good assignment. If you have a machine with 4 sockets, with 6 cores each, and 2 hyperthreads per core.. Obviously-good group sizes are 1 (1 group per HT), 2 (1 group per core), 12 (1 group per socket), and 48 (1 group for the entire machine). Then you have intermediate sizes: All this assumes that groups contain consecutive hyperthread and cores. As Samuel said, you don't want to "create a group which spans over the two sockets without including them all". |
To put a mathematical formulation: you can choose a group size which, for all structural elements of architecture (NUMA node, socket, cache, core, etc.), divides or is a multiple of the number of logical processors of the element of architecture So the simple case is to just take the number of logical processors of a given element (e.g. a socket). The less simple case is to take a group size which is a divisor of a given element (e.g. a socket), and a multiple of the element just below (e.g. L3 cache). |
Can we get the customer's "coreinfo -cgns" output for group size 8 environment? The coreinfo is attached in the coreinfo.txt file. I also add the output from hwloc-ls.exe, which I hope will provide all the information necessary to determine why Platform MPI cannot bind the ranks when we have multiple processor groups. The system in question has 80 processing units (2 processor groups, 40 processing units per group.) Or why they want to configure such small size processor groups on Windows HPC environment? Please note we do not want to use small non-standard processor groups sizes. I only used the bcedit method of changing processor group size on my workstation, because I thought it may be easier to reproduce the underlying problem of Platform MPI not being able to bind the ranks when we have more than one processor group. What are the processor group sizes that are intended to be used with the application? We intend to use the default size for a processor group, namely 64. The attached data is gathered on a test system (80 processing units) where we can reproduce the failure of Platform MPI to bind the ranks. |
Everything looks good in coreinfo and hwloc-ls. You forgot "hwloc-info --support" for debugging the binding issue. But in the end, we'll need to know what platform MPI uses for binding. Does it use hwloc_set_cpubind()? With which flags? Given that process binding is hard on Windows, they may have to fallback to thread binding if they want a good compromise. Thread binding may be enough if done early (before any other thread is started). Also, they'll have to check whether they binding before starting the application process (which requires binding to be inherited) or during MPI_Init in the application. Also, maybe check binding with another tool. Maybe platform doesn't report binding correctly :) Have you seen platform perform and report binding correctly on windows in the past? |
Platform MPI can report binding on Windows correctly. We have asked for the PMPI binding output on customer's product environment. They just gave us the output on group size 8 environment before. And we also let them to try hwloc-bind to find out the if it's the PMPI output issue. Before we get feed back I want to add more information. On our current environment, I try group size 2 case (Brice think this setting should work). I know the arbitrary group size is not recommend, to do this I just want to make sure if Windows can accept the setting and hwloc can detect topology without error, on this condition whether hwloc can bind correctly. I run following command and it seems lstopo is binding on random group. Checking with Windows task manager and it gives me the same result. |
"hwloc-info --support" says that process binding isn't supported here. Also lstopo should show a single PU in green in node:0 according to your hwloc-bind command-line. Instead, it shows 2 PUs in another group because the process wasn't bound (the default windows behavior is to assign a process to a random group and bind it to all cores of that group). Otherwise the lstopo output looks good. hwloc cannot bind entire processes when there are multiple groups (issue #78). So missing process binding support is expected here. Also we have issue #151 about hwloc-bind not working well on Windows because it's not clear whether process and/or thread binding is inherited during execvp(). |
I am closing this old issue because processor group support improved significant in recent years, and we have ways to test it here but we couldn't see any issue recently. If the bug still occurs, please open a new issue, |
Here is the detail information of this issue.
Test environment
Testing machine - Windows 2012 Standard version
hwloc - 1.11.7 win64
Steps
C:\hwloc-win64-build-1.11.7\bin>hwloc-ls
Machine (19GB total)
NUMANode L#0 (P#0 9848MB) + Package L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#1 (P#1 9861MB) + Package L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
bcdedit.exe /set groupsize 4
bcdedit.exe /set groupaware on
C:\hwloc-win64-build-1.11.7\bin>coreinfo -g
Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com
Logical Processor to Group Map:
Group 0:
Group 1:
Group 2:
C:\hwloc-win64-build-1.11.7\bin>hwloc-ls
Machine (19GB total)
NUMANode L#0 (P#0 9730MB) + L3 L#0 (12MB)
Package L#0
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
Package L#1 + L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
Group0 L#0
NUMANode L#1 (P#2) + L3 L#1 (12MB)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#64)
Package L#2 + L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#65)
NUMANode L#2 (P#3) + L3 L#2 (12MB)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#66)
Package L#3 + L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#67)
NUMANode L#3 (P#1 10175MB) + L3 L#3 (12MB)
Package L#4
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#128)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#129)
Package L#5 + L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#130)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#131)
C:\hwloc-win64-build-1.11.7\bin>hwloc-info
depth 0: 1 Machine (type #1)
depth 1: 1 Group0 (type #7)
depth 2: 4 NUMANode (type #2)
depth 3: 4 L3Cache (type #4)
depth 4: 6 Package (type #3)
depth 5: 12 L2Cache (type #4)
depth 6: 12 L1dCache (type #4)
depth 7: 12 L1iCache (type #4)
depth 8: 12 Core (type #5)
depth 9: 12 PU (type #6)
C:\hwloc-win64-build-1.11.7\bin>hwloc-calc.exe node:1.core:0
0x00000001,,0x0
C:\hwloc-win64-build-1.11.7\bin>hwloc-calc.exe node:1.core:1
0x00000002,,0x0
The text was updated successfully, but these errors were encountered: