Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open MPI MCA Param file not applied. #7737

Closed
abouteiller opened this issue May 14, 2020 · 17 comments
Closed

Open MPI MCA Param file not applied. #7737

abouteiller opened this issue May 14, 2020 · 17 comments
Labels
Milestone

Comments

@abouteiller
Copy link
Member

Changing MCA parameters through openmpi-mca-params.conf does not work (no effect). It shows in ompi_info, though.

Details of the problem

$installdir/etc/openmpi-mca-params.conf file changes are not applied to MPI processes.

File has been modified so that $installdir/bin/ompi_info reports such changes

           MCA pml base: ---------------------------------------------------
            MCA pml base: parameter "pml" (current value: "ob1", data source: file (/home/bouteill/ompi/ulfm/ulfm2/debug.build/etc/openmpi-mca-params.conf:62), level: 2 user/detail, type: string)
                          Default selection set of components for the pml framework (<none> means use all components that can be found)
            MCA pml base: ---------------------------------------------------
            MCA pml base: parameter "pml_base_verbose" (current value: "component", data source: file (/home/bouteill/ompi/ulfm/ulfm2/debug.build/etc/openmpi-mca-params.conf:63), level: 8 dev/detail, type: int)
                          Verbosity level for the pml framework (default: 0)
                          Valid values: -1:"none", 0:"error", 10:"component", 20:"warn", 40:"info", 60:"trace", 80:"debug", 100:"max", 0 - 100

Running $installdir/bin/mpirun -n 3 --omca pml_base_verbose=10 hello reports

[c00.cauchy:00657] select: initializing pml component ob1
[c00.cauchy:00657] select: init returned priority 20
[c00.cauchy:00657] select: initializing pml component ucx
[c00.cauchy:00654] select: init returned priority 51
[c00.cauchy:00654] selected ucx best priority 51
[c00.cauchy:00654] select: component ucx selected
[c00.cauchy:00654] select: component ob1 not selected / finalized

The PML ucx is selected, showing that pml=ob1 has been ignored.
Note how the pml_base_version is passed on the command line, as otherwise it would have no effect either.


Configuration info

master #0dc23252

git clone

 git submodule status
 4a43c39c89037f52b4e25927e58caf08f3707c33 opal/mca/hwloc/hwloc2/hwloc (hwloc-2.1.0rc2-53-g4a43c39c)
 4996218a7a13a0ec59c94ffd72ead6d95d6acb4e opal/mca/pmix/pmix4x/openpmix (v1.1.3-2366-g4996218a)
 9cf321a38d36900702fd9c0862f633f4981b740d prrte (dev-30542-g9cf321a3)
  • Operating system/version: CentOS7
  • Computer hardware: Xeon Westmere
  • Network type: Infiniband 10G

Edit

@rhc54
Copy link
Contributor

rhc54 commented May 16, 2020

Fixed by #7744

@rhc54 rhc54 closed this as completed May 16, 2020
@panda1100
Copy link

Hi @rhc54 , maybe this is newbie question but,
I see this problem on v4.1.x branch. What branch was fixed by #7744 ?
https://github.com/open-mpi/ompi/blame/v4.1.x/opal/mca/base/mca_base_var.c

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2023

It went into what is now the main branch - had nothing to do with v4.1

@panda1100
Copy link

panda1100 commented Apr 19, 2023

Thank you @rhc54, will be back ported to v4.1?

The default OpenMPI on Rocky Linux 8 and 9 is v4.1 and I used following workaround. But this is not ideal, especially when we want to add/modify settings system-wide. How do you guys apply system-wide settings on RHEL8/9 equivalent systems? I understand this is not the place for QA though..

for new users

mkdir -p /etc/skel/.openmpi
echo "btl_vader_single_copy_mechanism=none" >> /etc/skel/.openmpi/mca-params.conf

existing user

mkdir -p ~/.openmpi
echo "btl_vader_single_copy_mechanism=none" >> ~/.openmpi/mca-params.conf

@wckzhang
Copy link
Contributor

It doesn't look like it can be cleanly backported but if it's broken it probably should be fixed. I don't know if it will though as the v4.1.x series has slowed down a lot. @jsquyres @bwbarrett are the RM's for v4.1.x

@panda1100
Copy link

Thank you @wckzhang! It would be ideal if upstream fixed rather than apply patch ourselves.

@wckzhang
Copy link
Contributor

@panda1100 Hi I want to clarify the behavior you're seeing before investigating this issue. What problem are you seeing? Are the user/system level mca params not being read from the files on the 4.1.x branch? What commit are you working off of?

@panda1100
Copy link

Thank you @wckzhang !

We are on https://www.open-mpi.org/software/ompi/v4.1/downloads/openmpi-4.1.1.tar.bz2 , inside https://download.rockylinux.org/vault/rocky/9.0/AppStream/source/tree/Packages/o/openmpi-4.1.1-5.el9.src.rpm .

The problem I faced is echo "btl_vader_single_copy_mechanism=none" >> /etc/openmpi-x86_64/openmpi-mca-params.conf always ignored. But it works with echo "btl_vader_single_copy_mechanism=none" >> ~/.openmpi/mca-params.conf.
related issue (closed): #4948

The background context here is that I faced this issue when I used OpenMPI and Apptainer (HPC focused container solution that formerly known as Singularity) like this mpirun -np 60 openradioss.sif engine_linux64_gf_ompi -i CamryOpenRadioss_0001.rad. more details about how we use OpenMPI & Apptainer is here.

@rhc54
Copy link
Contributor

rhc54 commented Apr 26, 2023

What was your configure cmd line? Suspect your system default parameter file location isn't where OMPI expects it

@panda1100
Copy link

Thank you @rhc54, Let me check.

@panda1100
Copy link

panda1100 commented Apr 26, 2023

@rhc54 This is what we used to build OpenMPI.

configure command option for Rocky Linux 9
https://git.rockylinux.org/staging/rpms/openmpi/-/blob/r9/SPECS/openmpi.spec#L178-196

%build
%set_build_flags
./configure --prefix=%{_libdir}/%{name} \
	--mandir=%{_mandir}/%{namearch} \
	--includedir=%{_includedir}/%{namearch} \
	--sysconfdir=%{_sysconfdir}/%{namearch} \
	--disable-silent-rules \
	--enable-builtin-atomics \
	--enable-mpi-cxx \
	--enable-mpi-java \
	--enable-mpi1-compatibility \
	--with-sge \
	--with-valgrind \
	--enable-memchecker \
	--with-hwloc=/usr \
%if !0%{?el7}
	--with-libevent=external \
	--with-pmix=external \
%endif

and actual {_lidir} etc is following:

%{_sysconfdir} /etc  
%{_prefix} /usr can be defined to /app for flatpak builds
%{_exec_prefix} %{_prefix} default: /usr
%{_includedir} %{_prefix}/include default: /usr/include
%{_bindir} %{_exec_prefix}/bin default: /usr/bin
%{_libdir} %{_exec_prefix}/%{_lib} default: /usr/%{_lib}
%{_libexecdir} %{_exec_prefix}/libexec default: /usr/libexec
%{_sbindir} %{_exec_prefix}/sbin default: /usr/sbin
%{_datadir} %{_datarootdir} default: /usr/share
%{_infodir} %{_datarootdir}/info default: /usr/share/info
%{_mandir} %{_datarootdir}/man default: /usr/share/man
%{_docdir} %{_datadir}/doc default: /usr/share/doc
%{_rundir} /run  
%{_localstatedir} /var  
%{_sharedstatedir} /var/lib  
%{_lib} lib64 lib on 32bit platforms

https://docs.fedoraproject.org/en-US/packaging-guidelines/RPMMacros/

@panda1100
Copy link

panda1100 commented Apr 26, 2023

@rhc54 https://github.com/open-mpi/ompi/pull/7744/files uses opal_install_dirs.sysconfdir and we use --sysconfdir=%{_sysconfdir}/%{namearch} when cofigure.

%{_sysconfdir} is /etc and

%{namearch} is openmpi-x86_64
https://git.rockylinux.org/staging/rpms/openmpi/-/blob/r9/SPECS/openmpi.spec#L149

%{_sysconfdir}/%{namearch} traslated to /etc/openmpi-x86_64. so, I believe /etc/openmpi-x86_64/openmpi-mca-params.conf this should work,,

@rhc54
Copy link
Contributor

rhc54 commented Apr 26, 2023

I installed the HEAD of the v4.1.x branch, added an MCA param to the default param file, and it worked fine. I therefore expect that the problem lies in your use of those hieroglyphics to set the install directory for the default param file.

One way to check: an example openmpi-mca-params.conf should be installed in it. If that isn't present, then you have the wrong place.

@panda1100
Copy link

Thank you @rhc54 for your support. Our package uses v4.1.1, I will test against other 4.1.x point release and get back to you. Thank you again for your corporation.

@panda1100
Copy link

panda1100 commented Apr 27, 2023

@rhc54 I checked opal/mca/base/mca_base_var.c on v4.1.x and v4.1.1 to v4.1.5. Those uses asprintf() instead of opal_asprintf() but besides such difference, code structure itself is the same as before #7744 merged. My apologies for bother you, could you please share how you configure 4.1.x and full path of openmpi-mca-params.conf you used for test? I would like to replicate on my end with the same condition.

#if OPAL_WANT_HOME_CONFIG_FILES
    asprintf(&mca_base_var_files, "%s"OPAL_PATH_SEP".openmpi" OPAL_PATH_SEP
             "mca-params.conf%c%s" OPAL_PATH_SEP "openmpi-mca-params.conf",
             home, ',', opal_install_dirs.sysconfdir);
#else
    asprintf(&mca_base_var_files, "%s" OPAL_PATH_SEP "openmpi-mca-params.conf",
             opal_install_dirs.sysconfdir);
#endif

@rhc54
Copy link
Contributor

rhc54 commented Apr 27, 2023

Nothing particularly special:

$ ./configure --prefix=<foo>
$ make install
--- edit <foo>/etc/openmpi-mca-params.conf ---
$ mpirun -n 1 ./hello

@panda1100
Copy link

panda1100 commented Apr 28, 2023

@rhc54 @wckzhang
OpenMPI v4.1.1 does read $sysconfdir/openmpi-mca-params.conf.

The root cause on my environment is host OpenMPI and OpenMPI inside container used different prefix (sysconfdir) when it is built. In that case, $HOME/.openmpi/mca-params.conf works because Apptainer bind mounts homedir automatically but $sysconfdir/openmpi-mca-params.conf doesn't work due to $sysconfdir difference between host and container OpenMPI.

My apologies for the false alarm, and Thank you both for your kind support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants