-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print gpus used on simulator startup #5611
Print gpus used on simulator startup #5611
Conversation
d33b381
to
b754c7f
Compare
Jenkins build this hipify please |
Something about this PR causes the build system to not hipify anything, not sure why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I just have a question on the general assumptions here.
int deviceCount = -1; | ||
OPM_GPU_SAFE_CALL(cudaGetDeviceCount(&deviceCount)); | ||
|
||
const auto deviceId = mpiRank % deviceCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like there is an association of MPI ranks to GPU devices.
Out of curiosity: How are we enforcing this associating?
Basically, we are assuming that e.g. ranks 0,4,8,12,... are on the same node if we have four gpus in the calculation. There is of course some kind of control where to start MPI ranks. But it is actually really hard to do this. So the default would be that we need to detect which ranks are near the GPU (read on the same node) and make them use that GPU to reduce latency. Probably there is some mechanism for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the rank modulo the local device count would mean mapping for instance ranks {0, 1, 2, 3} to the same node if we have four gpus that we want to use, not {0, 4, 8, 12}? In the latter case the result of the modulo operation would always be 0 and these four processes would just use device0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I should give back my degrees.
My point was that I am doubting that one can assume any locations for the MPI ranks at all (unless enforcing it from the outside). Hence a rank associated with a GPU might live on a distant node, if if the rank is close. Depending on the interconnect that might make a difference. But I guess you are aware about this.
Seems like you can actually determine local ranks via creating communicators with MPI_COMM_TYPE_SHARED. The you can get an "on-node" communicator, to do the association in a multi-node-safe manner. See https://medium.com/@jeffrey_91423/binding-to-the-right-gpu-in-mpi-cuda-programs-263ac753d232
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GPU the only thing that makes sense I think is to use N processes, where N is the number of GPUs (Nodes * gpusPerNode), and then I think filling up the processes in a consecutive way (node1 has {p0, p1, p2, p3}, node2 has {p4, p4...}... will work just fine for now. This is probably not that hard to do with slurm, which I hope we will use in the multigpu case in the foreseeable future? Otherwise I can create an issue requesting more functionality to select gpus so we keep this in mind and handle it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add to this: The current way we are assigning the GPUs is very rudimentary, and might not yield the best performance (especially for two socket/CPU NUMA server nodes with multiple GPUs), nor is it guaranteed that the MPI ranks will map that nicely to different nodes/GPUs unless you control the submission. This is also why I wrote:
// Now do a round robin kind of assignment
// TODO: We need to be more sophistacted here. We have no guarantee this will pick the correct device.
(typo is highly intentional!)
The simple reason why this hasn't been developed or investigated further because we are not really at a point where running over multiple nodes with multiple GPUs is relevant. However, we are applying for access to LUMI-G, where this will be looked into. The current solution "works well enough" for the current testing we are doing.
Some of the headers are not found. Maybe they are missing from the PR? |
no, that is not the problem. i have explained to tobias in other channels. the problem is that we try to pull hip headers into simulator objects and simulator objects do not depend on the library, hence headers have not been hipified. |
b754c7f
to
a3ab5ae
Compare
Jenkins build this hip please |
Jenkins build this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly stylistic comment, but the catch all is probably something that should be avoided.
@@ -0,0 +1,51 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of this name. First this file includes two functions, one of which has nothing to do with setting the device, secondly, what indirection refers to is a bit unclear (and for a user of said file, maybe not needed).
May I suggest renaming to device_management.hpp
? and then include this file in the main file.
The file now named set_device.hpp
should probably be moved to detail
(and renamed to something like device_management
), as you don't want this file to be included from outside gpuistl
(and the detail
namespace signals this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
catch(...){} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to have a catch all here. At the very least just catch std::runtime_error
which is what OPM_GPU_SAFE_CALL
will call, but it is probably a better idea to just call OPM_GPU_WARN_IF_ERROR
and remove the whole try/catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
else{ | ||
out << "GPU: " << props.name << ", Compute Capability: " << props.major << "." << props.minor << std::endl; | ||
} | ||
OpmLog::info(out.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a lot more readable with fmt::format
to avoid the stringstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
d13d485
to
8474b41
Compare
The PR is now updated and I have tried to use deferred logger, which I think is the right choice when wanting output from all MPI ranks, this is not tested on a machine with multiple GPUs or on nvidia cards, which should be tested before merging |
125a664
to
344c787
Compare
344c787
to
964844a
Compare
jenkins build this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now.
Heads up: Merging this may or may not have broken the build. At least we're getting the following regression test failure following this PR: https://ci.opm-project.org/job/opm-simulators/2425/testReport/junit/(root)/mpi-hipify/outputdir/. |
I have discussed this with @akva2, the issue is caused by compiling with HIP when you do not have an AMD gpu. He will look into as I have some other commitments today |
Maybe I'm missing something, but the error message from HIP is just printed as a warning? |
the problem is that it somehow leads to a div by zero. i've just setup rocm on my box so i'll find it shortly. |
I'm a bit stumped, since Jenkins passed on this PR before I had accepted it, and the callstack looks like it's coming from somewhere completely different? But I'll wait for your investigation.
|
jenkins doesn't run the hipify build in the normal trigger, so we only saw it in the post-builder. it's fixed by #5705 |
in particular cudaGetDeviceCount doesn't set the result var to 0 if it fails to find any device, so we have the initial -1 value, and hence we did not get the div-by-zero. hipGetDeviceCount(..) however does set it to 0, hence we got the div-by-zero. |
Slight improvement in use of gpuISTL to show what device is used in the simulation
TODO: make sure we get the output from all the ranks