-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333
Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333
Conversation
/ok to test |
/ok to test |
1 similar comment
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @harrism. These changes look correct to me, I mostly have an additional query about what guarantees we want RMM to be providing in a multi-device setting wrt user management of device context switching.
Rationale: I still think there are circumstances which one might expect to work but which do not with these changes, and I suppose we should decide if we want to make those work, or document that they do not, and why.
rmm::device_buffer buf_b(16, rmm::cuda_stream_per_thread, mrs[1].get()); | ||
} | ||
|
||
RMM_CUDA_TRY(cudaSetDevice(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user forgets to call cudaSetDevice(0)
here then ~buf_a()
runs with device one's context active. In this case both with rmm::cuda_stream_default
and rmm::cuda_stream_per_thread
, the context attached to the stream will not match the context active when the event was created.
Do we want to handle that case transparently in this PR? I think no is fine, but if not we should add some documentation to the pool memory resource noting that it is the user's responsibility to ensure that in a multi-device setting, destructors are run with the current device set to the device of the allocating mr
.
I think the same comment also applies to the allocation. Edit: not necessary, since the docs already make use of an MR with a different device active UB.
Do we want to support this pattern below? Edit: No, this usage of a memory resource created with device-zero active and then used with device-one active is explicitly called out as undefined behaviour in the documentation.
{
cudaSetDevice(0);
{
cudaSetDevice(1);
rmm::device_uvector<int> a(10, rmm::cuda_stream_default, mrs[0].get());
rmm::device_uvector<int> b(20, rmm::cuda_stream_default, mrs[1].get());
}
cudaSetDevice(0);
rmm::device_uvector<int> c{0, rmm::cuda_stream_default, mrs[0].get()};
}
Note use of cuda_stream_default
(rather than stream_per_thread
). Here, the user has first used the mr
created with device-zero active to allocate a buffer with device-one active. So first time through the stream-event map gets populated with an event that was created with the device-one context active. This is then picked up when c
is destructed, but now the device-zero context is active and we get an error again.
Again, I think it's fine to punt on fixing this in this PR, but we should at least document the restrictions on the Edit: this is mentioned as UB in the docs.mr
object (which in this case would be that allocation must run with the device that the mr was created on active).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to spend some time thinking about this. We may want our RAII data structures to ensure they activate the device they were allocated on in their destructors...
Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Given your comment below about UB in the docs, I've updated my comment to answer some of the queries: we don't need to handle device switching in allocation, and my usage example is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.
Yeah, I am not sure which side of then fence I am on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this, and discussing with @jrhemstad , I am inclined to leave this as UB for now. I don't think we want to be switching devices back and forth every time we mr->deallocate()
. I'm not confident that this can be done without a large impact on performance, and there will be situations where the user has more information and can optimize device switching.
If it becomes a problem for containers, perhaps we can add it at the allocator
level, rather than the memory_resource
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense.
Suggestion (non-blocking): Shall we add a sentence explicitly mentioning destructors to the "Multiple devices" section of the documentation.
Perhaps after the existing first paragraph:
A device_memory_resource should only be used when the active CUDA device is the same device that was active when the device_memory_resource was created. Otherwise behavior is undefined.
"Note that this means that destructors must run with the same CUDA device active as used when creating their associated memory resource."
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Let me add something along these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added doc and example.
include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Outdated
Show resolved
Hide resolved
The cpp builds failed due to (I think) conda-forge/rhash-feedstock#25. It looks like an updated cmake conda package is now built so I have rerun the tests 🤞 |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @harrism, I've dismissed my requested changes, and propose one non-blocking suggestion to add an explicit sentence about destructors in the multiple-device setting to the docs. (My review has no power here though, since not a cpp-codeowner...)
@wence- Thank you for your attention to detail! Even if you are not a cpp-codeowner, since you reviewed so carefully, I count your approval toward the 2 C++ approvals we require. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
/ok to test |
/merge |
include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra shared_ptr
copy isn't the end of the world, so you could always address that later.
…Device in RAII class.
/ok to test |
/merge |
Description
As described in #1306 , stream_ordered_memory_resource maintained thread-local storage of an event to synchronize in the per-thread default stream. However, this caused the event to be recorded on the wrong stream if allocations were made on memory resources associated with different devices, because allocation on the first device on the PTDS would initialize the TLS for that stream, and subsequent device pools would try to use the already initialized TLS.
This PR adds a new test that only runs on multidevice systems (more correctly, does nothing on single device systems). The test creates two per-device pools, and creates and destroys a device buffer on each.
It also fixes
stream_ordered_memory_resource
to store the ID of the device that is current when it is constructed, and then to store a vector of one event per device in TLS rather than a single event. When the PTDS is passed toget_event
, the event for the MR's stored device ID is used. This should correctly support PTDS with multiple threads and multiple devices (still one MR per device).The PR also includes some changes to the device ID utilities in
cuda_device.hpp
. There is a new RAII device helper class, and aget_num_cuda_devices()
function.Finally, there is a small addition to the .clang-tidy to disable warnings about
do...while
loops inside of RMM error checking macros.Checklist