[BUG] multibyte_split implicitly synchronizes all streams #11929

vyasr · 2022-10-14T21:39:53Z

Describe the bug
Currently multibyte_split uses a stream pool to divide up work between multiple streams. However, output_builder.inplace_resize still resizes an rmm::device_uvector on CUDA's default stream. This function is being called from output_builder.advance_output, which does not currently accept a stream. This is problematic since the default stream implicitly synchronizes with all streams.

Expected behavior
From a quick analysis it is not clear to me whether advance_output or inplace_resize should be running on the scan_stream or the read_stream inside multibyte_split. Based on where advance_output is called, I suspect that it actually requires an explicit synchronization between these two streams, and perhaps the existing events are already intended to solve that need. Fixing this would require modifying this function to run on the appropriate stream, and perhaps introducing an explicit synchronization or some appropriate event waiting.

The text was updated successfully, but these errors were encountered:

vyasr · 2022-10-14T21:40:13Z

Perhaps @upsj or @cwharris have some insight into the appropriate fix here.

upsj · 2022-10-15T07:04:09Z

Does this actually synchronize? device_uvector doesn't and device_buffer doesn't either in the shrinking case, as far as I can tell. It only modifies the size member appropriately.
I have a choice between two approaches: Use device_uvector as a fixed-sized buffer and store its actual size separately, or use size and capacity to keep track of how many elements inside the vector are already in use. What do you think - should this be marked/documented more explicitly or store the size separately?
Without the sync-less resize, having a separate capacity and size seems unnecessary to me, and the reserve method no longer has a valid use case.

vyasr · 2022-10-15T18:26:59Z

Does this actually synchronize? device_uvector doesn't and device_buffer doesn't either in the shrinking case, as far as I can tell. It only modifies the size member appropriately.

You're right that neither of those synchronizes. However, device_buffer::resize sets the stream used for deallocating the buffer, which ends up triggering a sync on that stream when the destructor of the uvector is called. Both char_storage and row_offset_storage are function scoped, so they are only destroyed at the end of the function when this function's frame is popped off the stack. join_streams is called well before that and it looks like the input stream is set to wait on both the streams in the pool, so I think all that needs to happen is that the input stream needs to get passed through advance_output to inplace_resize so that the vectors in the two output_builders (char_storage and row_offset_storage) are destroyed on the user-provided input stream, i.e. after all work in multibyte split is done but without using the default stream. Avoiding the default stream saves us triggering a sync at the end of multibyte_split, which will be undesirable once we expose a truly stream-ordered API that allows users to run multiple libcudf APIs (or other external functions) simultaneously on different streams.

upsj · 2022-10-16T19:01:26Z

Thanks for the details, I'll adopt your suggestions

vyasr · 2022-10-17T18:54:31Z

@upsj in the interest of making sure everything works together I saved you the time and have gone ahead and implemented my proposed solution in #11875. Tests still pass for me locally, so I think it should be OK, but I'll keep you apprised if something goes wrong!

This PR ensures that cudf's default stream is properly passed to all kernel launches so that nothing implicitly runs on the CUDA default stream. It adds a small library that is built during the tests and overloads CUDA functions to throw an exception when usage of the default stream is detected. It also fixes all remaining usage of anything other than cudf's default stream (I fixed most of the issues in previous PRs, but I found a few others when finalizing this one). Resolves #11929 Resolves #11942 ### Important notes for reviewers: - **The changeset is deceptively large.** The vast majority of the changes are just a global find-and-replace of `cudf::get_default_stream()` for `cudf::default_stream_value`, as well as a few smaller fixes such as missing `CUDF_TEST_PROGRAM_MAIN` in a couple of tests and usage of `rmm::cuda_stream_default`. The meaningful changes are: - The new default stream getter/setter in `default_stream.[hpp|cpp]` - The addition of `cpp/tests/utilities/identify_stream_usage` - The changes to the base testing fixture in `cpp/include/cudf_test/base_fixture.hpp` to inject the custom stream. - The changes to CI in `ci/gpu/build.sh` to build and use the new library. - This PR is a breaking change because it moves the default stream into the detail namespace. Going forward the default stream may only be accessed using the public accessor `cudf::get_default_stream()`. I have added a corresponding setter, but it is also in the detail namespace since I do not want to publicly support changing the default stream yet, only for the purpose of testing. Reviewers, please leave comments if you disagree with those choices. - I have made getting and setting the default stream thread-safe, but there is still only a single stream. In multi-threaded applications we may want to support a stream per thread so that users could manually achieve PTDS with more fine-tuned control. Is this worthwhile? Even if it is, I'm inclined to wait for a subsequent PR to implement this unless someone feels strongly otherwise. - I'm currently only overloading `cudaLaunchKernel`. I can add overloads for other functions as well, but I didn't want to go through the effort of overloading every possible API. If reviewers have a minimal set that they'd like to see overloaded, let me know. [I've included links to all the relevant pages of the CUDA runtime API in the identify_stream_usage.cu file](https://github.com/rapidsai/cudf/pull/11875/files#diff-0b2762207c27c080acd2114475c7a1c06377a7c18c4e9c3de60ecbdc82a4dc61R99) if someone wants to look through them. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Sevag H (https://github.com/sevagh) - https://github.com/brandon-b-miller - Jake Hemstad (https://github.com/jrhemstad) - David Wendt (https://github.com/davidwendt) URL: #11875

vyasr added bug Something isn't working Needs Triage Need team to review and classify labels Oct 14, 2022

vyasr mentioned this issue Oct 14, 2022

Add tests ensuring that cudf's default stream is always used #11875

Merged

3 tasks

upsj self-assigned this Oct 15, 2022

rapids-bot bot closed this as completed in #11875 Oct 21, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] multibyte_split implicitly synchronizes all streams #11929

[BUG] multibyte_split implicitly synchronizes all streams #11929

vyasr commented Oct 14, 2022 •

edited

Loading

vyasr commented Oct 14, 2022

upsj commented Oct 15, 2022 •

edited

Loading

vyasr commented Oct 15, 2022 •

edited

Loading

upsj commented Oct 16, 2022

vyasr commented Oct 17, 2022

[BUG] multibyte_split implicitly synchronizes all streams #11929

[BUG] multibyte_split implicitly synchronizes all streams #11929

Comments

vyasr commented Oct 14, 2022 • edited Loading

vyasr commented Oct 14, 2022

upsj commented Oct 15, 2022 • edited Loading

vyasr commented Oct 15, 2022 • edited Loading

upsj commented Oct 16, 2022

vyasr commented Oct 17, 2022

vyasr commented Oct 14, 2022 •

edited

Loading

upsj commented Oct 15, 2022 •

edited

Loading

vyasr commented Oct 15, 2022 •

edited

Loading