Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform synchronization on a worker thread #2025

Merged
merged 8 commits into from
Aug 12, 2023
Merged

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Aug 11, 2023

As recommended by NVIDIA, instead of polling the context/stream/event, use a dedicated thread to perform the synchronization on. This is supported on 1.9+, where we have support for foreign threads. It's not particularly fast, 5us per call, but it's significantly better than the previous slow path (which was at least 25us, and could sometimes stall for much longer when the event loop was busy).

TODO: try to improve performance of the core mechanism.

cc @vchuravy

Alternative to #2014; @lcw could you test whether this is acceptable? Note that it requires 1.9.2 or 1.10.

@maleadt maleadt added enhancement New feature or request performance How fast can we go? labels Aug 11, 2023
lib/cudadrv/events.jl Show resolved Hide resolved
# any user will just submit work that makes it block

# we don't know what the size of uv_thread_t is, so reserve enough space
tid = Ref{NTuple{32, UInt8}}(ntuple(i -> 0, 32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should export an accessor from Julia to get this sizeof

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... 32 bytes ought to be enough for anybody? 😅

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Patch coverage: 95.89% and project coverage change: +0.24% 🎉

Comparison is base (38fb707) 62.31% compared to head (e56c6a8) 62.55%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2025      +/-   ##
==========================================
+ Coverage   62.31%   62.55%   +0.24%     
==========================================
  Files         151      152       +1     
  Lines       12842    12920      +78     
==========================================
+ Hits         8002     8082      +80     
+ Misses       4840     4838       -2     
Files Changed Coverage Δ
lib/cudadrv/events.jl 94.11% <ø> (-1.81%) ⬇️
lib/cudadrv/state.jl 80.87% <ø> (+2.18%) ⬆️
lib/cudadrv/stream.jl 95.12% <ø> (+0.18%) ⬆️
src/pool.jl 69.03% <ø> (ø)
lib/utils/memoization.jl 90.56% <80.00%> (+0.36%) ⬆️
lib/cudadrv/synchronization.jl 96.06% <96.06%> (ø)
lib/cudadrv/context.jl 72.07% <100.00%> (-1.21%) ⬇️
lib/utils/threading.jl 92.30% <100.00%> (+8.09%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lcw
Copy link
Contributor

lcw commented Aug 12, 2023

Thanks for working on this!

I had @aaustin141 rerun the benchmark we were using at JuliaCon. Here is
what he found.

This is pr is a significant improvement: the amount of dead time spent in
epoll_wait() is down from 100+ μs (and often a lot more than that) to
about 15-20 μs. See the before profile

before

and after profile

after

But it's still not as good as switching to blocking synchronization, in
which we have a scant 5 μs or so of dead time in cuStreamSynchronize()
before MPI starts. See the blocking profile below.

blocking

Here are some times required to run 20 V-cycles of @aaustin141's multigrid code
(1,048,576 degree-5 elements for 37,748,736 DG and 26,224,641 CG DoFs on
the finest level; times averaged over 3 runs):

Scheme                       Time (s)    Speedup (vs. 1 Rank)
-------------------------------------------------------------
1 Rank                       2.71        1.00
2 Ranks, Old Non-Blocking    1.85        1.46
2 Ranks, New Non-Blocking    1.71        1.58
2 Ranks, Blocking            1.58        1.72

So, we recommend adding the new non-blocking code but would still like a
blocking option.

@maleadt
Copy link
Member Author

maleadt commented Aug 12, 2023

So, we recommend adding the new non-blocking code but would still like a
blocking option.

OK, that's too bad. I added a preference to control the synchronization kind, which feels like a more idiomatic way than an environment variable (despite what I said earlier). Does that work too?

@lcw
Copy link
Contributor

lcw commented Aug 12, 2023

Yes, the preference is fine. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants