-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc rdma/tcp: Wrong answer with c_post_start in MTT #8983
Comments
Hi, @awlauria I tried to reproduced this error, but I was not able to reproduce. I noticed that the failure is on IBM powerpc64le, which I do not have access to. On a system with intel CPU, I did not encounter any error. On a system with AWS graviton 2 CPU, I encountered a different error (about an error in "MPI_Win_Create"). I will continue debug this issue but I doubt it is the same issue reported here. I wonder whether you were able to see this error outside powerpc64le? Is there a way for me to reproduce the error? |
Hi, I made a mistake when testing on graviton 2 : I was not using open mpi master branch but was using open mpi 4.1.2. Once I started using ompi master on graviton 2, I was able to reproduce the data validation error. Though to a lesser degree of the IBM CI. On IBM CI, the test fail consistently and each time there are ~16,000 error. On graviton 2, the test fails about 10% of the time, and when it fails there is usually 1 or 2 data error. Anyway, I think the reproducer I have now is good enough for me to continue to work on this issue. |
With the help from @bwbarrett, I think I know the root cause of this problem, which is related to completion semantics requirement of
Such an implementation requires the btl's completion to be delivery complete. Which means when btl report the put operation is completed, the data must have arrived at the target buffer. Btl/tcp's rdma does not satisfy this requirement. BTL/TCP use send to emulated put. When btl/tcp report the put operation to be finished, the data has been sent to the target rank. However, it is not guaranteed the the message has been processed by the target rank. When One thing worth noting is that this is a new issue (NOT a regression), because ompi 5.0.x is the first time osc/rdma can use btl/tcp. Prior to 5.0.x, btl/tcp does not support put, so osc/rdma cannot use it. IMO, the solution to this issue is to stop using the optimization Lacking some context, I do not quite understand the motivation of |
https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L1117 I can probably do that later today but if someone beats me to it I can review the PR. |
I can look at making AM-RDMA support |
@hjelmn If the two processes are already on the same node, why would they even use tcp (instead of vader) to communicate? |
@wzamazon in the test case you're running, TCP is being used because you set the BTL list to only include |
I see. So |
I don't understand how such an optimization would work with the vader BTL on a machine without SMSC support. Can someone clarify this for me please. |
@bosilca It wouldn't. The optimization was focused on two scenarios: 1) all on-node with MPI_Win_allocate (if for some reason osc/sm is not wanted), and 2) btl/openib with globally visible atomics (now btl/ofi). Was not intended for any of the AM-RDMA scenarios. Not sure why I added that flag. |
Oh, and the second scenario is not well tested either. I did not get much of a chance to test btl/ofi with that optimization. Maybe someone with omnipath should test that to make sure it works. |
I can confirm that btl/ofi does request FI_DELIVERY_COMPLETE from libfabric. So it is up to each libfabric provider to decide support of delivery complete. The EFA provider does support FI_DELIVERY_COMPLETE |
Ok, so that should be good. The only fix needed is in am-rdma then. |
Ah - thanks @wzamazon |
I tried turn off the flag MCA_BTL_ATOMIC_SUPPORTS_GLOB in https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L1117 as Nathan suggested. However, it resulted in a failure in
I will continue investigate. |
I did some more code reading/testing and found something interesting/confusing. As we known, on same instance, when CPU atomics is not used, each process uses selected btl to update counter (such as To give an example: rank0 and rank1 are on same instance. rank0 is the local leader, it setup shm. rank1 map its memory region to the shm region setup by rank0. rank0 want to update rank1's counter (such as Instead of send a message to rank1, rank0 sent a message to itself, to update the shm region mapped to rank1. rank1's counter was updated by shm. This does not for btl/tcp because btl/tcp cannot do self communication (I meant bml cannot find such an endpoint). Even if btl/tcp can do self communication, I doubt it is the right approach to use. I think we should use the same communication channel to update counter as the channel used to sender message. e.g. If rank0 want to update rank1's counter, it should send message directly to rank1. I think I will work toward using same communication channel to update the counter. But want to know if there is a particular reason for current approach (always use local leader). |
@hjelmn can chime in with the right answer, but my guess is that he was trying to save a bunch of small messages when ranks were all on the same node; you could then aggregate update messages and avoid sending N messages to the same remote host. But that does mean that the BTL either needs to provide completions that the remote side delivered the message or that we need to use ordered channels for the completion messages. |
Opened #9400 which implemented the proposed solution. |
Should be limited to the multi btl case only. btl/ugni will break if the local leader logic is not used there. Not to mention a ppn factor increase in memory registrations. |
@hjelmn Thanks for chiming in! I have a few questions regarding using local leader.
The original data correction happened to single BTL and btl is tcp. In which case, we cannot use cpu atomics (as discussed), neither can we local leader (correctness issue).
Can you elaborate on this case? why btl/ugni would require using local leader? It seems we might need another flag for using local leader. |
@wzamazon Hello. As you know, this issue is a blocker for Open MPI v5.0.0. Thanks. |
This is now fixed in the master/v5.0.x mtt based on last nights (3/6/2022) results. Closing - thanks @bwbarrett and @wzamazon |
MTT failure example
Test source
Run command (single node):
./exports/bin/mpirun --np 2 --mca osc rdma --mca btl tcp,self ./c_post_start
I already triaged similar. This is new with the support for osc/rdma with btl/tcp to master and v5.0.x.
The text was updated successfully, but these errors were encountered: