Skip to content

Conversation

jsquyres
Copy link
Member

@jsquyres jsquyres commented Oct 4, 2019

See individual commit messages for more details.

Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit ce2910a)
New MCA parameter: btl_usnic_ack_iteration_delay.  Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).

Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.

Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).

As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 968b1a5)
Significantly increase the default retrans timeout.  If the
retrans timeout is too soon, we can end up in a retransmission storm
where the logic will continually re-transmit the same frames during a
single run through the usNIC progress function (because the timer for
a single frame expires before we have run through re-transmitting all
the frames pending re-transmission).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 3cc95d8)
New MCA param: btl_usnic_max_resends_per_iteration.  This is the max
number of resends we'll do in a single pass through usNIC component
progress.  This prevents progress from getting stuck in an endless
loop of retransmissions (i.e., if more retransmissions are triggered
during the sending of retransmissions).  Specifically: we need to
leave the resend loop to allow receives to happen (which may ACK
messages we have sent previously, and therefore cause pending resends
to be moot).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 27e3040)
Move the prefix area from the head to the body in relevant size
computations.  This fixes a problem in high traffic situations where
usNIC may have sent from unregistered memory.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit fe7f772)
@jsquyres jsquyres added this to the v3.0.5 milestone Oct 4, 2019
It was previously accidentally set to 0.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 132e4ca)
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
(cherry picked from commit 3080033)
Copy link

@nrescobar nrescobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good.

Note: I am a coworker of Jeff's at Cisco.

@jsquyres
Copy link
Member Author

@bwbarrett Can you approve and merge? I have no one left in the OMPI community who can review usNIC stuff, so I got a co-worker to review it for me (but he doesn't have write perms to the OMPI repo, so his review "doesn't count" in terms of github's requirements).

Thanks.

@bwbarrett bwbarrett merged commit 00f60ba into open-mpi:v3.0.x Oct 16, 2019
@jsquyres jsquyres deleted the pr/v3.0.x/usnic-fixes-and-optimizations branch October 16, 2019 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants