-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tree spawn at scale #6715
Fix tree spawn at scale #6715
Conversation
The debrujin component is using an algorithm that doesn't respect the previously assigned parent ID. This causes the other components to have their routing trees broken whenever debrujin updates routes. This happens whenever more than 256 nodes are involved, thus breaking tree spawn for sizes >= 256 Thanks to @zrss for the report and diagnosis! Signed-off-by: Ralph Castain <rhc@pmix.org>
Just to be clear, the solution is to kill off the component? Is there any downside to that solution? |
If someone provides an MCA parameter specifically requesting the debrujin component, then that run will fail. The "better" fix is to bring across the one from master (the commit involves more and so we wouldn't want to cherry-pick it) and revise the routed/base functions to only select a single routed component. I've offered that solution as an alternative to this one, but it involves more lines being changed. I have no personal preference. |
I understand that restriction - I'm offering the RMs two approaches to the problem. As I said, I couldn't care less between them. |
@jladd-mlnx I think it's slightly more complex than that. We'll certainly have to have a discussion about the two solutions (neither are perfect, for very different reasons). I do think this is slightly different than the UCT change, in that as far as I know this component ends up not changing customer behavior in a significant way. The problem with the UCT change is that it did. For better or for worse, in 4.0, the UCT btl is being used in the one-sided path (if I understand the one-sided path on UCX properly). My current gut feeling is that removing this component is probably the wrong answer, but perhaps not building by default or lowering priority would be the right answer. But I need to dig into details a bit more. |
@bwbarrett The only solutions are to either remove the debrujin component or to ensure it is the only active component whenever it does get built. It cannot be active when any other routed component is active or else things will break. The solution on master is the "correct" one - make routed a single-select framework as it used to be. As we no longer support the rml/ofi component (and never did in any release branch), this results in no change to observable behavior. However, this changes more lines of code - hence my offer for this smaller change that has the one negative issue regarding backward compatibility. I don't know how big an issue that is in this case. The only way that component can be used is if someone specifically requests it to be the only active component (otherwise, its priority is too low). Unfortunately, as currently coded, it will still mess up the routing table even if it isn't the component being used to route messages. I'm unaware of anyone specifically setting an MCA param to force debrujin usage (outside perhaps of Nathan), so I am offering it as an option. If you'd prefer the "correct" change, I can throw that together later today. Not a big deal to do. |
We talked about this change (and related PRs on the v3.1.x and v4.0.x branches) on the call today and decided that removing the component is the right path forward. While we try very hard to avoid removing components in the middle of a release series, in this case we are going to do so. The component never worked properly in this release series, because of the stateful component selection and the bug in component selection logic for this component, and because of the cost of a proper fix, this is the right choice. |
The debrujin component is using an algorithm that doesn't respect the
previously assigned parent ID. This causes the other components to have
their routing trees broken whenever debrujin updates routes. This
happens whenever more than 256 nodes are involved, thus breaking tree
spawn for sizes >= 256
Thanks to @zrss for the report and diagnosis!
Signed-off-by: Ralph Castain rhc@pmix.org