Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: ssend reply is not safe with multiple messages #5574

Closed
1 task
hzhou opened this issue Sep 28, 2021 · 8 comments
Closed
1 task

ch4/ofi: ssend reply is not safe with multiple messages #5574

hzhou opened this issue Sep 28, 2021 · 8 comments

Comments

@hzhou
Copy link
Contributor

hzhou commented Sep 28, 2021

In the native path, ssend receiver sends ack to the sender via a direct tag match. When the sender sends multiple ssends to the receiver with the same tag, the reply is not guaranteed to match the original send.

The current multi-thread ssend tests in test/mpi/threads/pt2pt is insufficient to catch this issue because the test doesn't really check the semantics of send -- difficult to check. The test passes as long as all the ssend gets ack'ed regardless of mismatching.

TODO

  • make a test confirm this issue
    * maybe we can send multiple messages both large and small all with the same tag, and measure the Wtime of the ssend. If sending small message takes longer than large message, then we know it is wrong.
    ❌ different threads contend and the timing isn't reliable.
@raffenet
Copy link
Contributor

I guess it depends on completion order on the recv side? The way we ensure acks go to the right place in AM code is to use the send request handle. Could we do that here by setting the match bits to the request handle value? Or is there an issue with fitting all bits in the available tag space?

@hzhou
Copy link
Contributor Author

hzhou commented Sep 28, 2021

There are 64 bits in the matchbits, lower 32 bits for tag, up to 20 bits for context id, then at least 4 bits for misc flags, so we only have 8 bits left, which isn't enough to hold request handle.

I am thinking that when cq_data size is 8 -- seems typical, we could use cq_data for both the source rank and request handle. I am experimenting this in #5575

@raffenet
Copy link
Contributor

You don't need context id or tag in this case. Just sync bit and request handle, right?

@hzhou
Copy link
Contributor Author

hzhou commented Sep 28, 2021

You don't need context id or tag in this case. Just sync bit and request handle, right?

Oh, it's about sending the sreq handle to the receiver in the first place. If receiver doesn't have the sreq handle, it can't use it in the ack.

@raffenet
Copy link
Contributor

You don't need context id or tag in this case. Just sync bit and request handle, right?

Oh, it's about sending the sreq handle to the receiver in the first place. If receiver doesn't have the sreq handle, it can't use it in the ack.

Oh, right. I get it now.

@wesbland
Copy link
Contributor

There are also some cases where you don't get all 64 bits. I don't remember bits bits we use for a request handle.

@hzhou
Copy link
Contributor Author

hzhou commented Sep 28, 2021

There are also some cases where you don't get all 64 bits. I don't remember bits bits we use for a request handle.

Yes. For those cases, we have to resort to direct tag matching. But if we have the extra cq data bits, we can use it for more robustness. In PR #5575, we choose the mechanism based on runtime cq_data_size.

@hzhou
Copy link
Contributor Author

hzhou commented Sep 28, 2021

When multiple ssend with the same tags is on the fly, because all the SYNC packets coming back are identical, so it doesn't really matter which one is going to which, as long as in the end all the SYNC packets will be accounted for. The timing will be off, but in the end all will get completed. I don't think there is anything the user can detect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants