Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tree spawn at scale #6715

Merged
merged 1 commit into from
Jun 4, 2019
Merged

Fix tree spawn at scale #6715

merged 1 commit into from
Jun 4, 2019

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 30, 2019

The debrujin component is using an algorithm that doesn't respect the
previously assigned parent ID. This causes the other components to have
their routing trees broken whenever debrujin updates routes. This
happens whenever more than 256 nodes are involved, thus breaking tree
spawn for sizes >= 256

Thanks to @zrss for the report and diagnosis!

Signed-off-by: Ralph Castain rhc@pmix.org

The debrujin component is using an algorithm that doesn't respect the
previously assigned parent ID. This causes the other components to have
their routing trees broken whenever debrujin updates routes. This
happens whenever more than 256 nodes are involved, thus breaking tree
spawn for sizes >= 256

Thanks to @zrss for the report and diagnosis!

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54 rhc54 added the bug label May 30, 2019
@rhc54 rhc54 added this to the v3.1.5 milestone May 30, 2019
@rhc54 rhc54 requested a review from jsquyres May 30, 2019 13:14
@rhc54 rhc54 self-assigned this May 30, 2019
@bwbarrett
Copy link
Member

Just to be clear, the solution is to kill off the component? Is there any downside to that solution?

@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2019

If someone provides an MCA parameter specifically requesting the debrujin component, then that run will fail. The "better" fix is to bring across the one from master (the commit involves more and so we wouldn't want to cherry-pick it) and revise the routed/base functions to only select a single routed component. I've offered that solution as an alternative to this one, but it involves more lines being changed. I have no personal preference.

@jladd-mlnx
Copy link
Member

See @jsquyres 's comment in #6643 - removing the component would break backwards compatibility guarantee.

@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2019

I understand that restriction - I'm offering the RMs two approaches to the problem. As I said, I couldn't care less between them.

@bwbarrett
Copy link
Member

@jladd-mlnx I think it's slightly more complex than that. We'll certainly have to have a discussion about the two solutions (neither are perfect, for very different reasons). I do think this is slightly different than the UCT change, in that as far as I know this component ends up not changing customer behavior in a significant way. The problem with the UCT change is that it did. For better or for worse, in 4.0, the UCT btl is being used in the one-sided path (if I understand the one-sided path on UCX properly).

My current gut feeling is that removing this component is probably the wrong answer, but perhaps not building by default or lowering priority would be the right answer. But I need to dig into details a bit more.

@rhc54
Copy link
Contributor Author

rhc54 commented May 30, 2019

@bwbarrett The only solutions are to either remove the debrujin component or to ensure it is the only active component whenever it does get built. It cannot be active when any other routed component is active or else things will break.

The solution on master is the "correct" one - make routed a single-select framework as it used to be. As we no longer support the rml/ofi component (and never did in any release branch), this results in no change to observable behavior. However, this changes more lines of code - hence my offer for this smaller change that has the one negative issue regarding backward compatibility.

I don't know how big an issue that is in this case. The only way that component can be used is if someone specifically requests it to be the only active component (otherwise, its priority is too low). Unfortunately, as currently coded, it will still mess up the routing table even if it isn't the component being used to route messages. I'm unaware of anyone specifically setting an MCA param to force debrujin usage (outside perhaps of Nathan), so I am offering it as an option.

If you'd prefer the "correct" change, I can throw that together later today. Not a big deal to do.

@bwbarrett
Copy link
Member

We talked about this change (and related PRs on the v3.1.x and v4.0.x branches) on the call today and decided that removing the component is the right path forward. While we try very hard to avoid removing components in the middle of a release series, in this case we are going to do so. The component never worked properly in this release series, because of the stateful component selection and the bug in component selection logic for this component, and because of the cost of a proper fix, this is the right choice.

@bwbarrett bwbarrett merged commit 4f4e3ef into open-mpi:v3.1.x Jun 4, 2019
@rhc54 rhc54 deleted the cmr31/routed branch November 27, 2019 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants