Print gpus used on simulator startup #5611

multitalentloes · 2024-09-16T08:55:36Z

Slight improvement in use of gpuISTL to show what device is used in the simulation
TODO: make sure we get the output from all the ranks

multitalentloes · 2024-09-16T09:33:01Z

Jenkins build this hipify please

multitalentloes · 2024-09-16T09:57:24Z

Something about this PR causes the build system to not hipify anything, not sure why

blattms

Looks good. I just have a question on the general assumptions here.

blattms · 2024-09-17T11:26:28Z

opm/simulators/linalg/gpuistl/set_device.cpp

+    int deviceCount = -1;
+    OPM_GPU_SAFE_CALL(cudaGetDeviceCount(&deviceCount));
+
+    const auto deviceId = mpiRank % deviceCount;


This looks like there is an association of MPI ranks to GPU devices.
Out of curiosity: How are we enforcing this associating?

Basically, we are assuming that e.g. ranks 0,4,8,12,... are on the same node if we have four gpus in the calculation. There is of course some kind of control where to start MPI ranks. But it is actually really hard to do this. So the default would be that we need to detect which ranks are near the GPU (read on the same node) and make them use that GPU to reduce latency. Probably there is some mechanism for that...

Having the rank modulo the local device count would mean mapping for instance ranks {0, 1, 2, 3} to the same node if we have four gpus that we want to use, not {0, 4, 8, 12}? In the latter case the result of the modulo operation would always be 0 and these four processes would just use device0.

My bad. I should give back my degrees.

My point was that I am doubting that one can assume any locations for the MPI ranks at all (unless enforcing it from the outside). Hence a rank associated with a GPU might live on a distant node, if if the rank is close. Depending on the interconnect that might make a difference. But I guess you are aware about this.

Seems like you can actually determine local ranks via creating communicators with MPI_COMM_TYPE_SHARED. The you can get an "on-node" communicator, to do the association in a multi-node-safe manner. See https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232

For GPU the only thing that makes sense I think is to use N processes, where N is the number of GPUs (Nodes * gpusPerNode), and then I think filling up the processes in a consecutive way (node1 has {p0, p1, p2, p3}, node2 has {p4, p4...}... will work just fine for now. This is probably not that hard to do with slurm, which I hope we will use in the multigpu case in the foreseeable future? Otherwise I can create an issue requesting more functionality to select gpus so we keep this in mind and handle it later.

Just to add to this: The current way we are assigning the GPUs is very rudimentary, and might not yield the best performance (especially for two socket/CPU NUMA server nodes with multiple GPUs), nor is it guaranteed that the MPI ranks will map that nicely to different nodes/GPUs unless you control the submission. This is also why I wrote:

// Now do a round robin kind of assignment // TODO: We need to be more sophistacted here. We have no guarantee this will pick the correct device.

(typo is highly intentional!)

The simple reason why this hasn't been developed or investigated further because we are not really at a point where running over multiple nodes with multiple GPUs is relevant. However, we are applying for access to LUMI-G, where this will be looked into. The current solution "works well enough" for the current testing we are doing.

blattms · 2024-09-17T11:29:48Z

Some of the headers are not found. Maybe they are missing from the PR?

akva2 · 2024-09-17T11:31:15Z

no, that is not the problem. i have explained to tobias in other channels. the problem is that we try to pull hip headers into simulator objects and simulator objects do not depend on the library, hence headers have not been hipified.

multitalentloes · 2024-09-18T08:01:33Z

Jenkins build this hip please

multitalentloes · 2024-09-18T09:23:06Z

Jenkins build this please

kjetilly

Mostly stylistic comment, but the catch all is probably something that should be avoided.

kjetilly · 2024-09-19T11:11:04Z

opm/simulators/linalg/gpuistl/set_device_indirection.cpp

@@ -0,0 +1,51 @@
+/*


I am not a fan of this name. First this file includes two functions, one of which has nothing to do with setting the device, secondly, what indirection refers to is a bit unclear (and for a user of said file, maybe not needed).

May I suggest renaming to device_management.hpp ? and then include this file in the main file.

The file now named set_device.hpp should probably be moved to detail (and renamed to something like device_management), as you don't want this file to be included from outside gpuistl (and the detail namespace signals this)

kjetilly · 2024-09-19T11:23:45Z

opm/simulators/linalg/gpuistl/set_device.cpp

    }
+    catch(...){}


I don't think it's a good idea to have a catch all here. At the very least just catch std::runtime_error which is what OPM_GPU_SAFE_CALL will call, but it is probably a better idea to just call OPM_GPU_WARN_IF_ERROR and remove the whole try/catch.

kjetilly · 2024-09-19T11:24:20Z

opm/simulators/linalg/gpuistl/set_device.cpp

+        else{
+          out << "GPU: " << props.name << ", Compute Capability: " << props.major << "." << props.minor << std::endl;
+        }
+        OpmLog::info(out.str());


This is probably a lot more readable with fmt::format to avoid the stringstream.

multitalentloes · 2024-10-16T10:07:39Z

The PR is now updated and I have tried to use deferred logger, which I think is the right choice when wanting output from all MPI ranks, this is not tested on a machine with multiple GPUs or on nvidia cards, which should be tested before merging

multitalentloes · 2024-10-30T10:09:04Z

jenkins build this please

kjetilly

This looks good now.

bska · 2024-11-01T07:58:26Z

Heads up: Merging this may or may not have broken the build. At least we're getting the following regression test failure following this PR: https://ci.opm-project.org/job/opm-simulators/2425/testReport/junit/(root)/mpi-hipify/outputdir/.

multitalentloes · 2024-11-01T08:36:23Z

I have discussed this with @akva2, the issue is caused by compiling with HIP when you do not have an AMD gpu. He will look into as I have some other commitments today

kjetilly · 2024-11-01T09:21:12Z

Maybe I'm missing something, but the error message from HIP is just printed as a warning?

akva2 · 2024-11-01T09:22:06Z

the problem is that it somehow leads to a div by zero. i've just setup rocm on my box so i'll find it shortly.

kjetilly · 2024-11-01T09:32:15Z

I'm a bit stumped, since Jenkins passed on this PR before I had accepted it, and the callstack looks like it's coming from somewhere completely different? But I'll wait for your investigation.

[NON-XML-CHAR-0x1B][0m
[opm-jenkins2:1864847] *** Process received signal ***
[opm-jenkins2:1864847] Signal: Floating point exception (8)
[opm-jenkins2:1864847] Signal code: Integer divide-by-zero (1)
[opm-jenkins2:1864847] Failing at address: 0x924491
[opm-jenkins2:1864847] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7fa87574d050]
[opm-jenkins2:1864847] [ 1] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x924491]
[opm-jenkins2:1864847] [ 2] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x721c29]
[opm-jenkins2:1864847] [ 3] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x7245c8]
[opm-jenkins2:1864847] [ 4] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x725d7b]
[opm-jenkins2:1864847] [ 5] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(+0x323fe)[0x7fa90e3fe3fe]
[opm-jenkins2:1864847] [ 6] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost17execution_monitor13catch_signalsERKNS_8functionIFivEEE+0x14d)[0x7fa90e3fcb0d]
[opm-jenkins2:1864847] [ 7] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost17execution_monitor7executeERKNS_8functionIFivEEE+0x51)[0x7fa90e3fcb91]
[opm-jenkins2:1864847] [ 8] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost17execution_monitor8vexecuteERKNS_8functionIFvvEEE+0x31)[0x7fa90e3fcc61]
[opm-jenkins2:1864847] [ 9] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost9unit_test19unit_test_monitor_t21execute_and_translateERKNS_8functionIFvvEEEm+0x121)[0x7fa90e42b1a1]
[opm-jenkins2:1864847] [10] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(+0x37ce0)[0x7fa90e403ce0]
[opm-jenkins2:1864847] [11] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(+0x37fda)[0x7fa90e403fda]
[opm-jenkins2:1864847] [12] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost9unit_test9framework3runEmb+0x7f7)[0x7fa90e407f37]
[opm-jenkins2:1864847] [13] /lib/x86_64-linux-gnu/libboost_unit_test_framework.so.1.74.0(_ZN5boost9unit_test14unit_test_mainEPFbvEiPPc+0x22e)[0x7fa90e42a10e]
[opm-jenkins2:1864847] [14] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x72eb80]
[opm-jenkins2:1864847] [15] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fa87573824a]
[opm-jenkins2:1864847] [16] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fa875738305]
[opm-jenkins2:1864847] [17] /var/lib/jenkins/post-builder/workspace/opm-simulators/mpi-hipify/build-opm-simulators/bin/test_outputdir[0x71fe21]```

akva2 · 2024-11-01T09:33:36Z

jenkins doesn't run the hipify build in the normal trigger, so we only saw it in the post-builder. it's fixed by #5705

akva2 · 2024-11-01T09:47:48Z

in particular cudaGetDeviceCount doesn't set the result var to 0 if it fails to find any device, so we have the initial -1 value, and hence we did not get the div-by-zero. hipGetDeviceCount(..) however does set it to 0, hence we got the div-by-zero.
and since the regular trigger only builds with cuda, you did not see it before we ran with hip in the post-builder. stuff like this happens, the system caught it, we'll fix it and everybody is happy again.

multitalentloes force-pushed the print_gpu_info_on_startup branch 2 times, most recently from d33b381 to b754c7f Compare September 16, 2024 09:32

blattms reviewed Sep 17, 2024

View reviewed changes

multitalentloes force-pushed the print_gpu_info_on_startup branch from b754c7f to a3ab5ae Compare September 18, 2024 08:00

kjetilly requested changes Sep 19, 2024

View reviewed changes

multitalentloes force-pushed the print_gpu_info_on_startup branch from d13d485 to 8474b41 Compare October 16, 2024 10:06

multitalentloes force-pushed the print_gpu_info_on_startup branch 2 times, most recently from 125a664 to 344c787 Compare October 30, 2024 10:02

multitalentloes added 2 commits October 30, 2024 11:03

Print GPU used on every rank

798f7d5

document amd gpu/cpu issue

964844a

multitalentloes force-pushed the print_gpu_info_on_startup branch from 344c787 to 964844a Compare October 30, 2024 10:03

kjetilly approved these changes Oct 31, 2024

View reviewed changes

kjetilly merged commit e81cf62 into OPM:master Oct 31, 2024
1 check passed

akva2 mentioned this pull request Nov 1, 2024

fixed: division by zero if no gpu device was found #5705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print gpus used on simulator startup #5611

Print gpus used on simulator startup #5611

multitalentloes commented Sep 16, 2024 •

edited

Loading

multitalentloes commented Sep 16, 2024 •

edited by akva2

Loading

multitalentloes commented Sep 16, 2024

blattms left a comment

blattms Sep 17, 2024 •

edited

Loading

multitalentloes Sep 18, 2024

blattms Sep 18, 2024

multitalentloes Sep 19, 2024

kjetilly Sep 19, 2024

blattms commented Sep 17, 2024

akva2 commented Sep 17, 2024

multitalentloes commented Sep 18, 2024

multitalentloes commented Sep 18, 2024

kjetilly left a comment

kjetilly Sep 19, 2024

multitalentloes Oct 16, 2024

kjetilly Sep 19, 2024

multitalentloes Oct 16, 2024

kjetilly Sep 19, 2024

multitalentloes Oct 16, 2024

multitalentloes commented Oct 16, 2024

multitalentloes commented Oct 30, 2024

kjetilly left a comment

bska commented Nov 1, 2024

multitalentloes commented Nov 1, 2024

kjetilly commented Nov 1, 2024

akva2 commented Nov 1, 2024 •

edited

Loading

kjetilly commented Nov 1, 2024

akva2 commented Nov 1, 2024

akva2 commented Nov 1, 2024

Print gpus used on simulator startup #5611

Print gpus used on simulator startup #5611

Conversation

multitalentloes commented Sep 16, 2024 • edited Loading

multitalentloes commented Sep 16, 2024 • edited by akva2 Loading

multitalentloes commented Sep 16, 2024

blattms left a comment

Choose a reason for hiding this comment

blattms Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blattms commented Sep 17, 2024

akva2 commented Sep 17, 2024

multitalentloes commented Sep 18, 2024

multitalentloes commented Sep 18, 2024

kjetilly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

multitalentloes commented Oct 16, 2024

multitalentloes commented Oct 30, 2024

kjetilly left a comment

Choose a reason for hiding this comment

bska commented Nov 1, 2024

multitalentloes commented Nov 1, 2024

kjetilly commented Nov 1, 2024

akva2 commented Nov 1, 2024 • edited Loading

kjetilly commented Nov 1, 2024

akva2 commented Nov 1, 2024

akva2 commented Nov 1, 2024

multitalentloes commented Sep 16, 2024 •

edited

Loading

multitalentloes commented Sep 16, 2024 •

edited by akva2

Loading

blattms Sep 17, 2024 •

edited

Loading

akva2 commented Nov 1, 2024 •

edited

Loading