-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342
Comments
For context, this clarification in the readme was introduced as part of #1333. See in particular the discussion here #1333 (comment) The example of a constructor throwing such that the destructor runs with a different device active is one we didn't explicitly discuss there. |
I've been looking into adding this feature (to I would love suggestions @fkallen @jrhemstad @wence- |
How about a resource adapter which checks the device id on allocation and deallocation?
|
I think @fkallen's suggestion is a good one. The UB occurs when we try and use a memory resource with an incorrect device live. So, we can avoid explicitly carrying out UB by first checking if what we're about to do would be undefined and raising an error. As long as the check itself doesn't do something that is undefined, this is fine. Since the behaviour is undefined in the library rather than the language, the compiler will not be able to decide that the check is unnecessary (because the subsequent line would invoke UB if it did not pass). |
@fkallen great idea! |
This changes `device_buffer` to store the active CUDA device ID on creation, and (possibly temporarily) set the active device to that ID before allocating or freeing memory. It also adds tests for containers built on `device_buffer` (`device_buffer`, `device_uvector` and `device_scalar`) that ensure correct operation when the device is changed before doing things that alloc/dealloc memory for those containers. This fixes #1342 . HOWEVER, there is an important question yet to answer: `rmm::device_vector` is just an alias for `thrust::device_vector`, which does not use `rmm::device_buffer` for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different from `thrust::device_vector` (and therefore `rmm::device_vector`). Update: opinion is that it's probably OK to diverge from `device_vector`, and some think we should remove `rmm::device_vector`. ~While we discuss this I have set the DO NOT MERGE label.~ Authors: - Mark Harris (https://github.com/harrism) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Jake Hemstad (https://github.com/jrhemstad) URL: #1370
Is your feature request related to a problem? Please describe.
According to the RMM readme, the following code contains undefined behavior because the current device when calling device_uvector destructor may be different from the device that was active when the device_uvector was constructed.
The workaround is to destroy the uvector explicitly.
However, this may not be sufficient because the construction may throw (out-of-memory, for example). So one would need something like:
Describe the solution you'd like
rmm::device_uvector should automatically set the correct device for deallocation.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: