-
Notifications
You must be signed in to change notification settings - Fork 498
UCT/GDA: Use wqe_idx for rc_gda progress #10928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aadb4d2 to
fa5d4d4
Compare
| if (req != nullptr) { | ||
| comp = &req->comp; | ||
| req->device_ep = device_ep; | ||
| uct_device_completion_init(comp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also remove the TODO comment
src/uct/api/device/uct_device_impl.h
Outdated
| #include <uct/ib/mlx5/gdaki/gdaki.cuh> | ||
|
|
||
| union uct_device_completion { | ||
| uct_rc_gda_completion_t rc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rc -> rc_gda
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
| 0xffffff); | ||
| uct_rc_mlx5_gda_qedump("WQE", wqe_ptr, 64); | ||
| uct_rc_mlx5_gda_qedump("CQE", cqe64, 64); | ||
| ep->wqe_error = ep->sq_wqe_pi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some "atomic write" so it will be visible to all threads?
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
| if (ep->wqe_error && comp->wqe_idx < ep->wqe_error) { | ||
| return UCS_ERR_IO_ERROR; | ||
| } | ||
|
|
||
| status = (ucs_status_t)__shfl_sync(0xffffffff, status, 0); | ||
| __syncwarp(); | ||
| return status; | ||
| } else if (level == UCS_DEVICE_LEVEL_THREAD) { | ||
| return uct_rc_mlx5_gda_progress_thread(ep); | ||
| } else { | ||
| return UCS_ERR_UNSUPPORTED; | ||
| if (ep->sq_wqe_pi <= comp->wqe_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why need to check ep->wqe_error separately?
IMOcomp->wqe_idx < ep->wqe_erroris enough - do we need some kind of atomic read/write for ep->wqe_error?
- maybe rename ep->wqe_error to ep->wqe_error_pi?
- if another thread updated ep->sq_wqe_pi but didn't yet update ep->wqe_error, we might think some tl_comp is success, while in fact it is failure.
src/uct/ib/mlx5/gdaki/gdaki_dev.h
Outdated
| uint64_t cqe_ci; | ||
| int sq_lock; | ||
|
|
||
| uint64_t wqe_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move it after sq_wqe_pi?
iyastreb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with many threads, no hangs and perf is better than before:
-O 32 -T 1 7.7us
-O 32 -T 32 159us
-O 32 -T 64 153us
-O 16 -T 128 372us
-O 8 -T 256 1169us
-O 4 -T 512 1768us
|
This PR was hanging in cuda kernel perftest. So may be there is a real issue. |
No description provided.