Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use same GPU stream for all kernels #296

Merged
merged 34 commits into from
Jul 24, 2023
Merged

Conversation

MrBurmark
Copy link
Member

Use the same GPU stream for all kernels

Use a specific GPU stream for all Cuda/Hip kernels. This is done by using the same resource for all kernels. By default this is the RAJA default stream, but can be changed to stream 0 with the --gpu_stream_0 command line argument.

@MrBurmark
Copy link
Member Author

One thing I didn't think too deeply about while making this PR is what to do with calls to cudaMemcpy (non-async). I believe those API calls run on stream 0 which isn't a correctness problem as the RAJA default stream is synchronous with stream 0, but it isn't really in the spirit of the change in this PR either. I'm thinking about adding a resource argument to some of our memory helper functions to work around this.

@rhornung67
Copy link
Member

@MrBurmark the mem copy calls are outside of the kernel timing regions, so does it matter?

@MrBurmark MrBurmark force-pushed the feature/burmark1/gpu_stream branch from 4f84161 to 120c5f2 Compare January 20, 2023 17:44
@MrBurmark
Copy link
Member Author

MrBurmark commented Jan 20, 2023

Most of them don't matter, but there are some that are used in the timed loop in reduction kernels. I've been trying it out and it looks like there is a performance penalty for using implicitly synchronized streams compared to a single stream. I'm going to rewrite those memory calls to explicitly call cuda|hipMemcpyAsync and streamSynchronize.

@MrBurmark
Copy link
Member Author

I made the change for the kernels using memcpy and synchronize calls in the timed portion. They now use memcpyAsync+streamSynchronize or streamSynchronize to use their stream.

Copy link
Member

@rhornung67 rhornung67 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of changes for something we should have thought about earlier, huh? 😄 Thank you for working through this.

@MrBurmark MrBurmark requested review from artv3 and rhornung67 July 7, 2023 20:10
@MrBurmark MrBurmark force-pushed the feature/burmark1/gpu_stream branch from 46fd529 to 5517fc0 Compare July 10, 2023 18:12
@rhornung67 rhornung67 mentioned this pull request Jul 10, 2023
24 tasks
@MrBurmark MrBurmark force-pushed the feature/burmark1/gpu_stream branch from 89fb9f5 to 39b748f Compare July 10, 2023 22:00
@MrBurmark MrBurmark force-pushed the feature/burmark1/gpu_stream branch from fc2ce08 to 962bac4 Compare July 21, 2023 16:59
Copy link
Member

@artv3 artv3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful!

@rhornung67
Copy link
Member

rhornung67 commented Jul 24, 2023

@MrBurmark let's merge this one next when it gets through tioga CI.

Also, we need to make sure that all new kernels are following this pattern.

@MrBurmark MrBurmark enabled auto-merge July 24, 2023 20:19
@MrBurmark MrBurmark merged commit f713c27 into develop Jul 24, 2023
@MrBurmark MrBurmark deleted the feature/burmark1/gpu_stream branch July 24, 2023 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants