-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with CUDA accelerator component initialization #11831
Comments
I've been trying to use 5.0rc's on a cluster with CUDA devices and reported a bug previously about broken sm+ob1 support. How does CUDA support keep breaking in 5.0rc's? Is there CI for this? |
@BenWibking Is this the issue you are referring to? #10871 |
I was referring to #11399, which is now fixed. |
@devreal I thought this was addressed with #11297 I don't think this is the intended behavior. The delayed_init was introduced specifically to avoid this scenario. @BenWibking No CI atm, at least in external MTT. |
We did discuss this ticket in the last meeting and potential resolutions. The main reason for pr ae98e04 was the change to the coll/cuda component (i.e. coll/cuda being always compiled and wanting to make sure that it disqualifies itself if no GPUs are present). That change to coll/cuda has however not been ported to 5.0. So in theory if we want we can simply revert this pr on 5.0 without any consequences. |
Possibly the change I made to cuda_hmem_verify_devices() in this libfabric PR would be a helpful direction to explore: ofiwg/libfabric#9170 |
As long as any component requires the device component to be up during its initialization (like ob1) the delayed init of the CUDA component is useless (not sure how that ever worked). One fix would be to delay the comm component's device setup (i.e., streams) until we see the first communication. I guess this was the intent of #11253 but it missed the stream creation. |
Discussed on developers call today. @bosilca reminded us that this issue should really target milestone 5.0.1, as the delayed init code is in main but not v5.0.x branch. I'm changing the target tags accordingly. |
We're working with the CUDA accelerator component and tried to rebase my somewhat outdated branch to current main. I believe I found an issue with the way the CUDA component is initialized: Since ae98e04 we call
cuInit
inaccelerator_cuda_init
but do not set a context. Then in every call toopal_accelerator_cuda_delayed_init
henceforth (until the first call to a CUDA function by the application) we receive a NULL context fromcuCtxGetCurrent
and return an error (https://github.com/open-mpi/ompi/blob/main/opal/mca/accelerator/cuda/accelerator_cuda_component.c#L146). That prevents all other accelerator-related state in OMPI from properly initializing. On this particular system, at leastsmcuda
(mca_btl_smcuda_accelerator_init
) and ob1 (mca_pml_ob1_accelerator_init
) do not enable accelerator support because they cannot create a stream, unless the application does call into CUDA before callingMPI_Init
(because there will be a CUDA context in that case). Is this what we want?Interestingly, before ae98e04 we would not return an error from
opal_accelerator_cuda_delayed_init
(becausecuCtxGetCurrent
returned an error code) and so the accelerator support would work properly.I believe the same behavior exists in the 5.x release branch.
The text was updated successfully, but these errors were encountered: