Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][HIP][PI] Fix barrier #6490

Merged
merged 9 commits into from
Oct 6, 2022
Merged

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Jul 29, 2022

Fixes a bug in barrier implementation in CUDA and HIP plugins that often caused barrier not to work. The new implementation is also faster.

Tests in: intel/llvm-test-suite#1122

@t4c1
Copy link
Contributor Author

t4c1 commented Aug 17, 2022

I think none of these failures are actually related to the changes in this PR.

@@ -381,8 +381,25 @@ pi_result cuda_piEventRetain(pi_event event);

/// \endcond

void _pi_queue::compute_stream_wait_for_barrier_if_needed(CUstream stream,
pi_uint32 stream_i) {
if (barrier_event_ && !compute_applied_barrier_[stream_i]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to check if the barrier event has finished prior to this so we can potentially clear it and skip the cuStreamWaitEvent call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are proposing an additional CUDA API call for potentially every operation that might let us eliminate additional calls in the future. Whether this makes sense to do or not depends on whether we expect most streams will have work enqueued to them before or after all the work before the barrier is finished. If the answer is before, the current code will be more performant. If the answer is after, we want to do what you suggest.

My gut feeling says current implementation will be better for most use cases. However the reasoning from the previous paragraph assumes that cuStreamWaitEvent and cuEventQuery take roughly the same time. Looking into that could give us more information that could inform this decision. However, for now I would leave this as it is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is potentially more work up-front, though I would expect cuEventQuery to be somewhat lightweight, but doing some benchmarking for it may make sense before a final decision is made on this. I am okay to keep as-is. 😄

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AerialMantis
Copy link
Contributor

@steffenlarsen @smaslov-intel is this okay to merge now or are there further reviews we should request? I also see there is a failure for the ESIMD job, I assume this is unrelated.

@steffenlarsen
Copy link
Contributor

ESIMD failures seem to happen on other PRs as well. Lets merge this.

@steffenlarsen steffenlarsen merged commit 1c3d598 into intel:sycl Oct 6, 2022
steffenlarsen pushed a commit to intel/llvm-test-suite that referenced this pull request Nov 7, 2022
Improves the test for barrier to make it actually fail if the barrier implementation does not work.

Tests intel/llvm#6490
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…m-test-suite#1122)

Improves the test for barrier to make it actually fail if the barrier implementation does not work.

Tests intel#6490
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants