Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

harrism · 2023-08-31T07:23:09Z

Description

As described in #1306 , stream_ordered_memory_resource maintained thread-local storage of an event to synchronize in the per-thread default stream. However, this caused the event to be recorded on the wrong stream if allocations were made on memory resources associated with different devices, because allocation on the first device on the PTDS would initialize the TLS for that stream, and subsequent device pools would try to use the already initialized TLS.

This PR adds a new test that only runs on multidevice systems (more correctly, does nothing on single device systems). The test creates two per-device pools, and creates and destroys a device buffer on each.

It also fixes stream_ordered_memory_resource to store the ID of the device that is current when it is constructed, and then to store a vector of one event per device in TLS rather than a single event. When the PTDS is passed to get_event, the event for the MR's stored device ID is used. This should correctly support PTDS with multiple threads and multiple devices (still one MR per device).

The PR also includes some changes to the device ID utilities in cuda_device.hpp. There is a new RAII device helper class, and a get_num_cuda_devices() function.

Finally, there is a small addition to the .clang-tidy to disable warnings about do...while loops inside of RMM error checking macros.

Fixes [QST] Is get_event()'s thread_local problematic? #1306

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

… for each thread.

copy-pr-bot · 2023-08-31T07:23:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

harrism · 2023-08-31T07:32:26Z

/ok to test

harrism · 2023-08-31T07:39:30Z

/ok to test

harrism · 2023-08-31T07:40:56Z

/ok to test

wence-

Thanks @harrism. These changes look correct to me, I mostly have an additional query about what guarantees we want RMM to be providing in a multi-device setting wrt user management of device context switching.

Rationale: I still think there are circumstances which one might expect to work but which do not with these changes, and I suppose we should decide if we want to make those work, or document that they do not, and why.

tests/mr/device/pool_mr_tests.cpp

wence- · 2023-08-31T08:30:08Z

tests/mr/device/pool_mr_tests.cpp

+        rmm::device_buffer buf_b(16, rmm::cuda_stream_per_thread, mrs[1].get());
+      }
+
+      RMM_CUDA_TRY(cudaSetDevice(0));


If the user forgets to call cudaSetDevice(0) here then ~buf_a() runs with device one's context active. In this case both with rmm::cuda_stream_default and rmm::cuda_stream_per_thread, the context attached to the stream will not match the context active when the event was created.

Do we want to handle that case transparently in this PR? I think no is fine, but if not we should add some documentation to the pool memory resource noting that it is the user's responsibility to ensure that in a multi-device setting, destructors are run with the current device set to the device of the allocating mr.

~~I think the same comment also applies to the allocation.~~ Edit: not necessary, since the docs already make use of an MR with a different device active UB.

Do we want to support this pattern below? Edit: No, this usage of a memory resource created with device-zero active and then used with device-one active is explicitly called out as undefined behaviour in the documentation.

{ cudaSetDevice(0); { cudaSetDevice(1); rmm::device_uvector<int> a(10, rmm::cuda_stream_default, mrs[0].get()); rmm::device_uvector<int> b(20, rmm::cuda_stream_default, mrs[1].get()); } cudaSetDevice(0); rmm::device_uvector<int> c{0, rmm::cuda_stream_default, mrs[0].get()}; }

Note use of cuda_stream_default (rather than stream_per_thread). Here, the user has first used the mr created with device-zero active to allocate a buffer with device-one active. So first time through the stream-event map gets populated with an event that was created with the device-one context active. This is then picked up when c is destructed, but now the device-zero context is active and we get an error again.

Again, I think it's fine to punt on fixing this in this PR, but we should at least document the restrictions on the mr object (which in this case would be that allocation must run with the device that the mr was created on active). Edit: this is mentioned as UB in the docs.

I'd like to spend some time thinking about this. We may want our RAII data structures to ensure they activate the device they were allocated on in their destructors...

Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.

Thanks. Given your comment below about UB in the docs, I've updated my comment to answer some of the queries: we don't need to handle device switching in allocation, and my usage example is invalid.

Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.

Yeah, I am not sure which side of then fence I am on here.

After thinking about this, and discussing with @jrhemstad , I am inclined to leave this as UB for now. I don't think we want to be switching devices back and forth every time we mr->deallocate(). I'm not confident that this can be done without a large impact on performance, and there will be situations where the user has more information and can optimize device switching.

If it becomes a problem for containers, perhaps we can add it at the allocator level, rather than the memory_resource level.

Thanks, this makes sense.

Suggestion (non-blocking): Shall we add a sentence explicitly mentioning destructors to the "Multiple devices" section of the documentation.

Perhaps after the existing first paragraph:

A device_memory_resource should only be used when the active CUDA device is the same device that was active when the device_memory_resource was created. Otherwise behavior is undefined.

"Note that this means that destructors must run with the same CUDA device active as used when creating their associated memory resource."

WDYT?

Yes. Let me add something along these lines.

I added doc and example.

tests/mr/device/pool_mr_tests.cpp

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

wence- · 2023-08-31T11:19:21Z

The cpp builds failed due to (I think) conda-forge/rhash-feedstock#25. It looks like an updated cmake conda package is now built so I have rerun the tests 🤞

wence- · 2023-09-05T08:00:21Z

/ok to test

harrism · 2023-09-06T00:24:08Z

/ok to test

wence-

Thanks @harrism, I've dismissed my requested changes, and propose one non-blocking suggestion to add an explicit sentence about destructors in the multiple-device setting to the docs. (My review has no power here though, since not a cpp-codeowner...)

harrism · 2023-09-06T21:41:01Z

@wence- Thank you for your attention to detail! Even if you are not a cpp-codeowner, since you reviewed so carefully, I count your approval toward the 2 C++ approvals we require.

…vice-ptds-event

harrism · 2023-09-07T01:02:51Z

/ok to test

bdice

I reviewed the issue, code, and prior PR discussions here. To the best of my knowledge, this looks like a good fix. I learned a fair bit while reviewing this PR, so thanks for the lessons @harrism @wence- !

README.md

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

harrism · 2023-09-12T23:40:05Z

/ok to test

harrism · 2023-09-12T23:40:20Z

/merge

include/rmm/cuda_device.hpp

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

jrhemstad

The extra shared_ptr copy isn't the end of the world, so you could always address that later.

…Device in RAII class.

harrism · 2023-09-13T00:28:23Z

/ok to test

harrism · 2023-09-13T00:44:23Z

/merge

harrism added 4 commits August 31, 2023 05:36

Disable clang-tidy check for do while in macros

39edcd0

Add new CUDA device utilities

421007b

Add failing (in debug build) multidevice pool test

429c71d

Store per-device TLS event wrappers instead of a single event wrapper…

f01d797

… for each thread.

harrism added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Aug 31, 2023

harrism self-assigned this Aug 31, 2023

harrism requested a review from a team as a code owner August 31, 2023 07:23

harrism requested review from cwharris and jrhemstad August 31, 2023 07:23

harrism requested a review from wence- August 31, 2023 07:23

github-actions bot added the cpp Pertains to C++ code label Aug 31, 2023

style

79a298d

wence- requested changes Aug 31, 2023

View reviewed changes

harrism added 2 commits September 5, 2023 02:35

Add compile_commands.json to .gitignore

d27f574

Header in angle brackets

f299c03

style

28c094d

merge 23.10

951c34b

wence- approved these changes Sep 6, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/branch-23.10' into fix-multide…

9f5c6db

…vice-ptds-event

Add doc and example about dtors under Multiple Devices

bbe46bd

harrism added 4 - Needs Reviewer Waiting for reviewer to review or respond and removed 3 - Ready for review Ready for review by team labels Sep 7, 2023

bdice approved these changes Sep 12, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

harrism and others added 2 commits September 13, 2023 09:39

Fix device ID typo.

397336b

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

Merge branch 'branch-23.10' into fix-multidevice-ptds-event

df469f4

jrhemstad reviewed Sep 12, 2023

View reviewed changes

include/rmm/cuda_device.hpp Show resolved Hide resolved

jrhemstad reviewed Sep 12, 2023

View reviewed changes

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Outdated Show resolved Hide resolved

harrism added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 12, 2023

jrhemstad approved these changes Sep 13, 2023

View reviewed changes

Eliminate extraneous shared pointer copy and be smarter about cudaSet…

44c89cc

…Device in RAII class.

harrism removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 13, 2023

rapids-bot bot merged commit 2ff1bd0 into rapidsai:branch-23.10 Sep 13, 2023

wence- mentioned this pull request Sep 18, 2023

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

harrism commented Aug 31, 2023 •

edited by wence-

Loading

copy-pr-bot bot commented Aug 31, 2023

harrism commented Aug 31, 2023

harrism commented Aug 31, 2023

harrism commented Aug 31, 2023

wence- left a comment

wence- Aug 31, 2023 •

edited

Loading

harrism Aug 31, 2023

wence- Aug 31, 2023

wence- Aug 31, 2023

harrism Sep 6, 2023

wence- Sep 6, 2023

harrism Sep 6, 2023

harrism Sep 7, 2023

wence- commented Aug 31, 2023

wence- commented Sep 5, 2023

harrism commented Sep 6, 2023

wence- left a comment •

edited

Loading

harrism commented Sep 6, 2023

harrism commented Sep 7, 2023

bdice left a comment

harrism commented Sep 12, 2023

harrism commented Sep 12, 2023

jrhemstad left a comment

harrism commented Sep 13, 2023

harrism commented Sep 13, 2023

Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

Conversation

harrism commented Aug 31, 2023 • edited by wence- Loading

Description

Checklist

copy-pr-bot bot commented Aug 31, 2023

harrism commented Aug 31, 2023

harrism commented Aug 31, 2023

harrism commented Aug 31, 2023

wence- left a comment

Choose a reason for hiding this comment

wence- Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

harrism Aug 31, 2023

Choose a reason for hiding this comment

wence- Aug 31, 2023

Choose a reason for hiding this comment

wence- Aug 31, 2023

Choose a reason for hiding this comment

harrism Sep 6, 2023

Choose a reason for hiding this comment

wence- Sep 6, 2023

Choose a reason for hiding this comment

harrism Sep 6, 2023

Choose a reason for hiding this comment

harrism Sep 7, 2023

Choose a reason for hiding this comment

wence- commented Aug 31, 2023

wence- commented Sep 5, 2023

harrism commented Sep 6, 2023

wence- left a comment • edited Loading

Choose a reason for hiding this comment

harrism commented Sep 6, 2023

harrism commented Sep 7, 2023

bdice left a comment

Choose a reason for hiding this comment

harrism commented Sep 12, 2023

harrism commented Sep 12, 2023

jrhemstad left a comment

Choose a reason for hiding this comment

harrism commented Sep 13, 2023

harrism commented Sep 13, 2023

harrism commented Aug 31, 2023 •

edited by wence-

Loading

wence- Aug 31, 2023 •

edited

Loading

wence- left a comment •

edited

Loading