-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.x: osc/ucx fix osc ucx component priority selection #10448
v4.1.x: osc/ucx fix osc ucx component priority selection #10448
Conversation
Addresses: #10433 |
|
|
main commit equivalent: c2e6cd9 Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com> Co-authored-by: Mamzi Bayatpour <mbayatpour@nvidia.com> bot:notacherrypick
fdd11bd
to
56307af
Compare
As mentioned in #10433, osc/ucx is still selected on our OPA nodes with this patch. I ran a two-process MPI job with
It looks like the patch does what it is supposed to do: setting the priority to 19. Using I'm not sure if this is the right approach though. As shown in #10433, this might be an issue in OpenMPI itself and not UCX as it iseems to be fixed in OpenMPI 5. Is disabling osc/ucx on OPA hardware still the best choice? Also, this patch makes it impossible to override the priority as it is hardcoded to 19 now on systems with matching transport but without matching devices. |
@michaellass thanks for checking and the git bisect. I can certainly lower the priority even further to 9. |
That sounds good. Is there a way to print out all considered osc mechanisms along with their priority? To determine 10 as a threshold I just tried out all values to see if the problem is still reproducable but it may be interesting to see which mechanism is actually chosen and why. |
what is shown on your system if you do an |
Regarding osc it shows:
This is with the patched version, so Full output
|
…found, don't override user priority setting main commit equivalent: db68824 Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com> bot:notacherrypick
@michaellass Do you mind trying the updated patch, priority is set to 9, and it doesn't override user setting, hopefully this will work for you |
Yes, this version avoids the use of osc/ucx on our system and therefore fixes the issue. I also tried to reintroduce the issue by setting |
Excellent! Thanks |
main commit equivalent: c2e6cd9
Signed-off-by: Tomislav Janjusic tomislavj@nvidia.com
Co-authored-by: Mamzi Bayatpour mbayatpour@nvidia.com
bot:notacherrypick