-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi: ssend reply is not safe with multiple messages #5574
Comments
I guess it depends on completion order on the recv side? The way we ensure acks go to the right place in AM code is to use the send request handle. Could we do that here by setting the match bits to the request handle value? Or is there an issue with fitting all bits in the available tag space? |
There are 64 bits in the matchbits, lower 32 bits for tag, up to 20 bits for context id, then at least 4 bits for misc flags, so we only have 8 bits left, which isn't enough to hold request handle. I am thinking that when cq_data size is 8 -- seems typical, we could use cq_data for both the source rank and request handle. I am experimenting this in #5575 |
You don't need context id or tag in this case. Just sync bit and request handle, right? |
Oh, it's about sending the |
Oh, right. I get it now. |
There are also some cases where you don't get all 64 bits. I don't remember bits bits we use for a request handle. |
Yes. For those cases, we have to resort to direct tag matching. But if we have the extra cq data bits, we can use it for more robustness. In PR #5575, we choose the mechanism based on runtime cq_data_size. |
When multiple |
In the native path, ssend receiver sends ack to the sender via a direct tag match. When the sender sends multiple ssends to the receiver with the same tag, the reply is not guaranteed to match the original send.
The current multi-thread ssend tests in
test/mpi/threads/pt2pt
is insufficient to catch this issue because the test doesn't really check the semantics of send -- difficult to check. The test passes as long as all the ssend gets ack'ed regardless of mismatching.TODO
* maybe we can send multiple messages both large and small all with the same tag, and measure the Wtime of the ssend. If sending small message takes longer than large message, then we know it is wrong.
❌ different threads contend and the timing isn't reliable.
The text was updated successfully, but these errors were encountered: