Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.1.x: osc/ucx fix osc ucx component priority selection #10448

Merged
merged 2 commits into from
Jul 18, 2022

Conversation

janjust
Copy link
Contributor

@janjust janjust commented Jun 3, 2022

main commit equivalent: c2e6cd9

Signed-off-by: Tomislav Janjusic tomislavj@nvidia.com
Co-authored-by: Mamzi Bayatpour mbayatpour@nvidia.com

bot:notacherrypick

@janjust janjust requested a review from yosefe June 3, 2022 12:51
@janjust
Copy link
Contributor Author

janjust commented Jun 3, 2022

Addresses: #10433

@jsquyres
Copy link
Member

jsquyres commented Jun 5, 2022

osc_ucx_component.c: In function 'component_init':
osc_ucx_component.c:192:72: error: 'ompi_osc_ucx_component_t' has no member named 'wpool'
     support_level = opal_common_ucx_support_level(mca_osc_ucx_component.wpool->ucp_ctx);
                                                                        ^
osc_ucx_component.c:194:42: error: 'ompi_osc_ucx_component_t' has no member named 'ucp_ctx'
         ucp_cleanup(mca_osc_ucx_component.ucp_ctx);
                                          ^
osc_ucx_component.c:195:30: error: 'ompi_osc_ucx_component_t' has no member named 'ucp_ctx'
         mca_osc_ucx_component.ucp_ctx = NULL;
                              ^
make[2]: *** [osc_ucx_component.lo] Error 1

@jsquyres jsquyres added this to the v4.1.5 milestone Jun 5, 2022
@janjust
Copy link
Contributor Author

janjust commented Jun 5, 2022

osc_ucx_component.c: In function 'component_init':
osc_ucx_component.c:192:72: error: 'ompi_osc_ucx_component_t' has no member named 'wpool'
support_level = opal_common_ucx_support_level(mca_osc_ucx_component.wpool->ucp_ctx);
^
osc_ucx_component.c:194:42: error: 'ompi_osc_ucx_component_t' has no member named 'ucp_ctx'
ucp_cleanup(mca_osc_ucx_component.ucp_ctx);
^
osc_ucx_component.c:195:30: error: 'ompi_osc_ucx_component_t' has no member named 'ucp_ctx'
mca_osc_ucx_component.ucp_ctx = NULL;
^
make[2]: *** [osc_ucx_component.lo] Error 1
Oh darn, forgot to update my PR, I’ll fix it, thanks!

main commit equivalent: c2e6cd9

Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
Co-authored-by: Mamzi Bayatpour <mbayatpour@nvidia.com>

bot:notacherrypick
@michaellass
Copy link
Contributor

As mentioned in #10433, osc/ucx is still selected on our OPA nodes with this patch. I ran a two-process MPI job with OMPI_MCA_osc_ucx_verbose=999 and it printed the following:

[cn-0256:1409705] common_ucx.c:174 using OPAL memory hooks as external events
[cn-0256:1409704] common_ucx.c:174 using OPAL memory hooks as external events
[cn-0256:1409705] common_ucx.c:332 posix/memory: did not match transport list
[cn-0256:1409705] common_ucx.c:332 sysv/memory: did not match transport list
[cn-0256:1409705] common_ucx.c:332 self/memory0: did not match transport list
[cn-0256:1409705] common_ucx.c:332 tcp/ib0: did not match transport list
[cn-0256:1409705] common_ucx.c:332 tcp/eno1: did not match transport list
[cn-0256:1409705] common_ucx.c:332 tcp/lo: did not match transport list
[cn-0256:1409705] common_ucx.c:327 rc_verbs/hfi1_0:1: matched transport list but not device list
[cn-0256:1409705] common_ucx.c:327 ud_verbs/hfi1_0:1: matched transport list but not device list
[cn-0256:1409705] common_ucx.c:332 cma/memory: did not match transport list
[cn-0256:1409705] common_ucx.c:337 support level is transports only
[cn-0256:1409705] osc_ucx_component.c:205 returning priority 19
[cn-0256:1409704] common_ucx.c:332 posix/memory: did not match transport list
[cn-0256:1409704] common_ucx.c:332 sysv/memory: did not match transport list
[cn-0256:1409704] common_ucx.c:332 self/memory0: did not match transport list
[cn-0256:1409704] common_ucx.c:332 tcp/ib0: did not match transport list
[cn-0256:1409704] common_ucx.c:332 tcp/eno1: did not match transport list
[cn-0256:1409704] common_ucx.c:332 tcp/lo: did not match transport list
[cn-0256:1409704] common_ucx.c:327 rc_verbs/hfi1_0:1: matched transport list but not device list
[cn-0256:1409704] common_ucx.c:327 ud_verbs/hfi1_0:1: matched transport list but not device list
[cn-0256:1409704] common_ucx.c:332 cma/memory: did not match transport list
[cn-0256:1409704] common_ucx.c:337 support level is transports only
[cn-0256:1409704] osc_ucx_component.c:205 returning priority 19

It looks like the patch does what it is supposed to do: setting the priority to 19.

Using OMPI_MCA_osc_ucx_priority, I tried how far the priority needs to be reduced to make OpenMPI choose a different osc mechanism on our systems: It is 10.

I'm not sure if this is the right approach though. As shown in #10433, this might be an issue in OpenMPI itself and not UCX as it iseems to be fixed in OpenMPI 5. Is disabling osc/ucx on OPA hardware still the best choice? Also, this patch makes it impossible to override the priority as it is hardcoded to 19 now on systems with matching transport but without matching devices.

@janjust
Copy link
Contributor Author

janjust commented Jun 13, 2022

@michaellass thanks for checking and the git bisect. I can certainly lower the priority even further to 9.
As far as what caused the "fix" in v5.0, the change to wpool is a substaintial change from v4.1 which specifically address multi-threaded performance. But I'm not sure that fixed it, it could just be a side effect. Moreover, there are several bugs in that specific version of wpool commit.
Also, I'll see about finding a workaround which doesn't hardwire priority so that the user can overwrite it.

@michaellass
Copy link
Contributor

That sounds good. Is there a way to print out all considered osc mechanisms along with their priority? To determine 10 as a threshold I just tried out all values to see if the problem is still reproducable but it may be interesting to see which mechanism is actually chosen and why.

@janjust
Copy link
Contributor Author

janjust commented Jun 13, 2022

what is shown on your system if you do an ompi_info -a | grep priority

@michaellass
Copy link
Contributor

what is shown on your system if you do an ompi_info -a | grep priority

Regarding osc it shows:

             MCA osc ucx: parameter "osc_ucx_priority" (current value: "60", data source: default, level: 3 user/all, type: unsigned_int)
            MCA osc rdma: parameter "osc_rdma_priority" (current value: "101", data source: default, level: 3 user/all, type: unsigned_int)

This is with the patched version, so ompi_info seems not to take these runtime adjustments into account.

Full output
                          How often to poll high priority CQ versus low priority CQ
      MCA (-) btl openib: parameter "btl_openib_connect_rdmacm_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)
                          The selection method priority for rdma_cm
      MCA (-) btl openib: parameter "btl_openib_connect_udcm_priority" (current value: "63", data source: default, level: 9 dev/all, type: int)
                          Maximum priority send descriptors to post (-1 = pre-set defaults; depends on number and type of devices available)
                          Number of pre-posted priority receive buffers (-1 = pre-set defaults; depends on number and type of devices available)
           MCA btl usnic: parameter "btl_usnic_priority_limit" (current value: "0", data source: default, level: 5 tuner/detail, type: int)
                          Max size of "priority" messages (0 = use pre-set defaults; depends on number and type of devices available)
       MCA compress gzip: parameter "compress_gzip_priority" (current value: "15", data source: default, level: 9 dev/all, type: int)
       MCA compress bzip: parameter "compress_bzip_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
            MCA crs none: informational "crs_none_priority" (current value: "1", data source: default, level: 3 user/all, type: int)
      MCA memory patcher: parameter "memory_patcher_priority" (current value: "80", data source: default, level: 5 tuner/detail, type: int)
      MCA mpool hugepage: parameter "mpool_hugepage_priority" (current value: "50", data source: default, level: 9 dev/all, type: int)
                          Default priority of the hugepage mpool component (default: 50)
   MCA patcher overwrite: parameter "patcher_overwrite_priority" (current value: "37", data source: default, level: 5 tuner/detail, type: int)
           MCA pmix flux: parameter "pmix_flux_priority" (current value: "20", data source: default, level: 9 dev/all, type: int)
          MCA shmem sysv: parameter "shmem_sysv_priority" (current value: "30", data source: default, level: 3 user/all, type: int)
          MCA shmem mmap: parameter "shmem_mmap_priority" (current value: "50", data source: default, level: 3 user/all, type: int)
         MCA shmem posix: parameter "shmem_posix_priority" (current value: "40", data source: default, level: 3 user/all, type: int)
  MCA errmgr default_app: parameter "errmgr_default_app_priority" (current value: "1000", data source: default, level: 9 dev/all, type: int)
 MCA errmgr default_tool: parameter "errmgr_default_tool_priority" (current value: "1000", data source: default, level: 9 dev/all, type: int)
MCA errmgr default_orted: parameter "errmgr_default_orted_priority" (current value: "1000", data source: default, level: 9 dev/all, type: int)
  MCA errmgr default_hnp: parameter "errmgr_default_hnp_priority" (current value: "1000", data source: default, level: 9 dev/all, type: int)
      MCA grpcomm direct: parameter "grpcomm_direct_priority" (current value: "85", data source: default, level: 9 dev/all, type: int)
             MCA plm rsh: parameter "plm_rsh_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
           MCA rmaps ppr: parameter "rmaps_ppr_priority" (current value: "90", data source: default, level: 9 dev/all, type: int)
       MCA rmaps mindist: parameter "rmaps_mindist_priority" (current value: "20", data source: default, level: 9 dev/all, type: int)
     MCA rmaps resilient: parameter "rmaps_resilient_priority" (current value: "40", data source: default, level: 9 dev/all, type: int)
   MCA rmaps round_robin: parameter "rmaps_round_robin_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
           MCA rmaps seq: parameter "rmaps_seq_priority" (current value: "60", data source: default, level: 9 dev/all, type: int)
     MCA rmaps rank_file: parameter "rmaps_rank_file_priority" (current value: "0", data source: default, level: 9 dev/all, type: int)
           MCA rtc hwloc: parameter "rtc_hwloc_priority" (current value: "70", data source: default, level: 9 dev/all, type: int)
         MCA coll libnbc: parameter "coll_libnbc_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
           MCA coll sync: parameter "coll_sync_priority" (current value: "50", data source: default, level: 9 dev/all, type: int)
             MCA coll sm: parameter "coll_sm_priority" (current value: "0", data source: default, level: 9 dev/all, type: int)
          MCA coll tuned: parameter "coll_tuned_priority" (current value: "30", data source: default, level: 6 tuner/all, type: int)
          MCA coll inter: parameter "coll_inter_priority" (current value: "40", data source: default, level: 9 dev/all, type: int)
          MCA coll basic: parameter "coll_basic_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
          MCA coll adapt: parameter "coll_adapt_priority" (current value: "0", data source: default, level: 9 dev/all, type: int)
           MCA coll self: parameter "coll_self_priority" (current value: "75", data source: default, level: 9 dev/all, type: int)
            MCA coll han: parameter "coll_han_priority" (current value: "0", data source: default, level: 9 dev/all, type: int)
          MCA fbtl posix: parameter "fbtl_posix_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
       MCA fcoll dynamic: parameter "fcoll_dynamic_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
    MCA fcoll individual: parameter "fcoll_individual_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
  MCA fcoll dynamic_gen2: parameter "fcoll_dynamic_gen2_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
        MCA fcoll vulcan: parameter "fcoll_vulcan_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
     MCA fcoll two_phase: parameter "fcoll_two_phase_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
              MCA fs ufs: parameter "fs_ufs_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
           MCA fs lustre: parameter "fs_lustre_priority" (current value: "20", data source: default, level: 9 dev/all, type: int)
         MCA io romio321: parameter "io_romio321_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
         MCA io romio321: parameter "io_romio321_delete_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
                          Delete priority of the io romio component
            MCA io ompio: parameter "io_ompio_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)
            MCA io ompio: parameter "io_ompio_delete_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)
                          Delete priority of the io ompio component
            MCA mtl psm2: parameter "mtl_psm2_priority" (current value: "40", data source: default, level: 9 dev/all, type: int)
             MCA mtl ofi: parameter "mtl_ofi_priority" (current value: "25", data source: default, level: 9 dev/all, type: int)
             MCA osc ucx: parameter "osc_ucx_priority" (current value: "60", data source: default, level: 3 user/all, type: unsigned_int)
                          List of device driver pattern names, which, if supported by UCX, will bump its priority above ob1. Special values: any (any available)
            MCA osc rdma: parameter "osc_rdma_priority" (current value: "101", data source: default, level: 3 user/all, type: unsigned_int)
                          Comma-delimited list of MTL component names to lower the priority of rdma osc component favoring pt2pt osc (default: psm2)
             MCA pml ob1: parameter "pml_ob1_priority" (current value: "20", data source: default, level: 9 dev/all, type: int)
             MCA pml ucx: parameter "pml_ucx_priority" (current value: "51", data source: default, level: 3 user/all, type: int)
                          List of device driver pattern names, which, if supported by UCX, will bump its priority above ob1. Special values: any (any available)
 MCA sharedfp individual: parameter "sharedfp_individual_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
 MCA sharedfp lockedfile: parameter "sharedfp_lockedfile_priority" (current value: "10", data source: default, level: 9 dev/all, type: int)
         MCA sharedfp sm: parameter "sharedfp_sm_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)
 MCA vprotocol pessimist: parameter "vprotocol_pessimist_priority" (current value: "30", data source: default, level: 9 dev/all, type: int)

…found, don't override user priority setting

main commit equivalent: db68824

Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>

bot:notacherrypick
@janjust janjust requested review from devreal and removed request for yosefe June 23, 2022 21:53
@janjust
Copy link
Contributor Author

janjust commented Jun 23, 2022

@michaellass Do you mind trying the updated patch, priority is set to 9, and it doesn't override user setting, hopefully this will work for you

@michaellass
Copy link
Contributor

@michaellass Do you mind trying the updated patch, priority is set to 9, and it doesn't override user setting, hopefully this will work for you

Yes, this version avoids the use of osc/ucx on our system and therefore fixes the issue. I also tried to reintroduce the issue by setting OMPI_MCA_osc_ucx_priority=11 and this worked as well, so the manual override of the priority also works again.

@janjust
Copy link
Contributor Author

janjust commented Jun 26, 2022

Excellent! Thanks

@bwbarrett bwbarrett merged commit bbcab84 into open-mpi:v4.1.x Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants