Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc rdma/tcp: Wrong answer with c_post_start in MTT #8983

Closed
awlauria opened this issue May 19, 2021 · 26 comments
Closed

osc rdma/tcp: Wrong answer with c_post_start in MTT #8983

awlauria opened this issue May 19, 2021 · 26 comments

Comments

@awlauria
Copy link
Contributor

awlauria commented May 19, 2021

MTT failure example

Test source

Run command (single node):

./exports/bin/mpirun --np 2 --mca osc rdma --mca btl tcp,self ./c_post_start

I already triaged similar. This is new with the support for osc/rdma with btl/tcp to master and v5.0.x.

@wzamazon
Copy link
Contributor

wzamazon commented Sep 8, 2021

Hi, @awlauria

I tried to reproduced this error, but I was not able to reproduce.

I noticed that the failure is on IBM powerpc64le, which I do not have access to.

On a system with intel CPU, I did not encounter any error.

On a system with AWS graviton 2 CPU, I encountered a different error (about an error in "MPI_Win_Create"). I will continue debug this issue but I doubt it is the same issue reported here.

I wonder whether you were able to see this error outside powerpc64le? Is there a way for me to reproduce the error?

@wzamazon
Copy link
Contributor

wzamazon commented Sep 8, 2021

Hi, I made a mistake when testing on graviton 2 : I was not using open mpi master branch but was using open mpi 4.1.2.

Once I started using ompi master on graviton 2, I was able to reproduce the data validation error. Though to a lesser degree of the IBM CI. On IBM CI, the test fail consistently and each time there are ~16,000 error. On graviton 2, the test fails about 10% of the time, and when it fails there is usually 1 or 2 data error.

Anyway, I think the reproducer I have now is good enough for me to continue to work on this issue.

@wzamazon
Copy link
Contributor

With the help from @bwbarrett, I think I know the root cause of this problem, which is related to completion semantics requirement of use_cpu_atomics.

use_cpu_atomics is an optimization, which mix use NIC's completion and CPU atomics. It is used when all processes are on single node. In this case, the actual messages are delivered by selected btl, but the counters to indicate completion (such as num_complete_msgs) are updated using CPU atomics (via shm of osc/rdma).

Such an implementation requires the btl's completion to be delivery complete. Which means when btl report the put operation is completed, the data must have arrived at the target buffer.

Btl/tcp's rdma does not satisfy this requirement. BTL/TCP use send to emulated put. When btl/tcp report the put operation to be finished, the data has been sent to the target rank. However, it is not guaranteed the the message has been processed by the target rank.

When use_cpu_atomics is used with btl/tcp (on single nodes), the counter num_complete_msgs is updated using cpu atomics after btl/tcp report the put is finished. However, because btl/tcp does not support delivery complete, it can happen that num_complete_msgs is updated before the message was delivered, hence the data corruption.

One thing worth noting is that this is a new issue (NOT a regression), because ompi 5.0.x is the first time osc/rdma can use btl/tcp. Prior to 5.0.x, btl/tcp does not support put, so osc/rdma cannot use it.

IMO, the solution to this issue is to stop using the optimization use_cpu_atomics if the selected btl does not support delivery complete.

Lacking some context, I do not quite understand the motivation of use_cpu_atomics. If we are already on single node, why not use btl/vader for everything?

@hjelmn
Copy link
Member

hjelmn commented Sep 13, 2021

use_cpu_atomics is intended when HCA and CPU atomics can be mixed. The only network I know of where this is the case is Infinipath/Omnipath where the atomic is done with the CPU. The completion semantics there are different than when using btl/tcp. So, the best fix is to remove the MCA_BTL_ATOMIC_SUPPORTS_GLOB flag from here:

https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L1117

I can probably do that later today but if someone beats me to it I can review the PR.

@hjelmn
Copy link
Member

hjelmn commented Sep 13, 2021

I can look at making AM-RDMA support use_cpu_atomics later. It is a useful optimization when the network has high latency as is the case with anything TCP.

@wzamazon
Copy link
Contributor

@hjelmn If the two processes are already on the same node, why would they even use tcp (instead of vader) to communicate?

@bwbarrett
Copy link
Member

@wzamazon in the test case you're running, TCP is being used because you set the BTL list to only include tcp,self. So there's no shared memory btl active.

@wzamazon
Copy link
Contributor

I see. So use_cpu_atomics is to accelerate the scenario that for some reason, user must use certain btl to transfer message.

@bosilca
Copy link
Member

bosilca commented Sep 13, 2021

I don't understand how such an optimization would work with the vader BTL on a machine without SMSC support. Can someone clarify this for me please.

@hjelmn
Copy link
Member

hjelmn commented Sep 13, 2021

@bosilca It wouldn't. The optimization was focused on two scenarios: 1) all on-node with MPI_Win_allocate (if for some reason osc/sm is not wanted), and 2) btl/openib with globally visible atomics (now btl/ofi). Was not intended for any of the AM-RDMA scenarios. Not sure why I added that flag.

@hjelmn
Copy link
Member

hjelmn commented Sep 13, 2021

Oh, and the second scenario is not well tested either. I did not get much of a chance to test btl/ofi with that optimization. Maybe someone with omnipath should test that to make sure it works.

@wzamazon
Copy link
Contributor

wzamazon commented Sep 13, 2021

I can confirm that btl/ofi does request FI_DELIVERY_COMPLETE from libfabric. So it is up to each libfabric provider to decide support of delivery complete. The EFA provider does support FI_DELIVERY_COMPLETE

@hjelmn
Copy link
Member

hjelmn commented Sep 13, 2021

Ok, so that should be good. The only fix needed is in am-rdma then.

@gpaulsen
Copy link
Member

gpaulsen commented Sep 16, 2021

Fixed in v5.0.x via #9381 -oops. I am incorrect. Please see @wzamazon's reply below.

@wzamazon
Copy link
Contributor

@gpaulsen

Maybe there is a misunderstanding here.

#9381 did not fix this data corruption issue. It fixed a btl/ofi segmentation fault, which I did not open an openmpi issue.

My understanding is @hjelmn is working on a fix for this (data corruption) issue.

@awlauria
Copy link
Contributor Author

Ah - thanks @wzamazon

@wzamazon
Copy link
Contributor

I tried turn off the flag MCA_BTL_ATOMIC_SUPPORTS_GLOB in

https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/base/btl_base_am_rdma.c#L1117

as Nathan suggested. However, it resulted in a failure in MPI_Win_create. I set osc_base_verbose to 100, which printed more information:

[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] could not find a suitable btl. falling back on active-message BTLs
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] checking for btl sm
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] skipping btl self
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] skipping btl tcp
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] checking for btl tcp
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] skipping btl self
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] found alternate btl tcp
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] allocating shared internal state
[prterun-ip-172-31-48-130-28696@1,1]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28701] allocating shared internal state
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] failed to allocate internal state
[prterun-ip-172-31-48-130-28696@1,0]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28700] rdma component destroying window with id 3
[prterun-ip-172-31-48-130-28696@1,1]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28701] failed to allocate internal state
[prterun-ip-172-31-48-130-28696@1,1]<stderr>:[ip-172-31-48-130.us-west-2.compute.internal:28701] rdma component destroying window with id 3

I will continue investigate.

@wzamazon
Copy link
Contributor

wzamazon commented Sep 20, 2021

I did some more code reading/testing and found something interesting/confusing.

As we known, on same instance, when CPU atomics is not used, each process uses selected btl to update counter (such as num_complete_msgs) on the other process. However, what surprises me is that: every process communicate the local leader (instead of the peer) to update peer's counter. when local_lead is not the peer, the peer's counter was updated by the shm.

To give an example:

rank0 and rank1 are on same instance.

rank0 is the local leader, it setup shm.

rank1 map its memory region to the shm region setup by rank0.

rank0 want to update rank1's counter (such as num_complete_msgs) using selected btl (when not using cpu atomics)

Instead of send a message to rank1, rank0 sent a message to itself, to update the shm region mapped to rank1.

rank1's counter was updated by shm.

This does not for btl/tcp because btl/tcp cannot do self communication (I meant bml cannot find such an endpoint).

Even if btl/tcp can do self communication, I doubt it is the right approach to use. I think we should use the same communication channel to update counter as the channel used to sender message. e.g. If rank0 want to update rank1's counter, it should send message directly to rank1.

I think I will work toward using same communication channel to update the counter. But want to know if there is a particular reason for current approach (always use local leader).

@bwbarrett
Copy link
Member

@hjelmn can chime in with the right answer, but my guess is that he was trying to save a bunch of small messages when ranks were all on the same node; you could then aggregate update messages and avoid sending N messages to the same remote host. But that does mean that the BTL either needs to provide completions that the remote side delivered the message or that we need to use ordered channels for the completion messages.

@wzamazon
Copy link
Contributor

Opened #9400 which implemented the proposed solution.

@hjelmn
Copy link
Member

hjelmn commented Sep 22, 2021

Should be limited to the multi btl case only. btl/ugni will break if the local leader logic is not used there. Not to mention a ppn factor increase in memory registrations.

@wzamazon
Copy link
Contributor

@hjelmn Thanks for chiming in! I have a few questions regarding using local leader.

Should be limited to the multi btl case only.

The original data correction happened to single BTL and btl is tcp. In which case, we cannot use cpu atomics (as discussed), neither can we local leader (correctness issue).

btl/ugni will break if the local leader logic is not used there

Can you elaborate on this case? why btl/ugni would require using local leader?

It seems we might need another flag for using local leader.

@wzamazon
Copy link
Contributor

@hjelmn I update #9400, so it is used only when active message RDMA is used, so it should be applied to only btl/tcp and btl/vader (see the commit message for more details).

@gpaulsen
Copy link
Member

gpaulsen commented Dec 2, 2021

@wzamazon Hello. As you know, this issue is a blocker for Open MPI v5.0.0.
In the above comment (from Sep 24) you referenced #9400 which was closed without being merged. You mentioned you were going to rework it. Could you please update the status here with the current Pull Request references?

Thanks.

@wzamazon
Copy link
Contributor

wzamazon commented Dec 2, 2021

@gpaulsen #9400 has been replaced by #9696, which will fix the issue.

@awlauria
Copy link
Contributor Author

awlauria commented Mar 7, 2022

This is now fixed in the master/v5.0.x mtt based on last nights (3/6/2022) results.

Closing - thanks @bwbarrett and @wzamazon

@awlauria awlauria closed this as completed Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants