Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with parallelization when running CP2K on quantum-mobile:20.11.2a #180

Open
sphuber opened this issue Mar 29, 2021 · 11 comments
Open
Assignees

Comments

@sphuber
Copy link
Collaborator

sphuber commented Mar 29, 2021

Taken from the discussion thread of PR #160 👍

When running a CP2K relax workflow on the quantum-mobile:20.11.2a docker container on an Ubuntu host OS, there seems to be a problem with the parallelization. Many more processes are launched than intended and multiple processes start to write independently to the output file.

@yakutovicha who ran on MacOS host could not reproduce this.

Here is a screenshot from htop once I launch a single CP2K relax workchain with the fast protocol for silicon:

screenshot_cp2k_htop

It spawns 24 processes for my 12 core CPU and uses them to the max. Could there be a problem with the parallelization that causes it to run double? In the output file, I see the following


 SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 NoMix/Diag. 0.10E+00  171.5     1.08006898        -7.9431608334 -7.94E+00
     1 NoMix/Diag. 0.10E+00  174.8     1.08006898        -7.9431608334 -7.94E+00
     2 Broy./Diag. 0.10E+00  177.1     0.00168440        -7.9277225536  1.54E-02
     2 Broy./Diag. 0.10E+00  181.4     0.00168440        -7.9277225536  1.54E-02
     3 Broy./Diag. 0.10E+00  171.9     0.02871103        -7.7806865131  1.47E-01
     3 Broy./Diag. 0.10E+00  180.7     0.02871103        -7.7806865131  1.47E-01
     4 Broy./Diag. 0.10E+00  181.0     0.00045755        -7.8172268412 -3.65E-02
     4 Broy./Diag. 0.10E+00  178.3     0.00045755        -7.8172268412 -3.65E-02
     5 Broy./Diag. 0.10E+00  172.8     0.00370116        -7.8395344612 -2.23E-02
     5 Broy./Diag. 0.10E+00  175.5     0.00370116        -7.8395344612 -2.23E-02
     6 Broy./Diag. 0.10E+00  173.2     0.00035810        -7.8469207606 -7.39E-03
     6 Broy./Diag. 0.10E+00  177.5     0.00035810        -7.8469207606 -7.39E-03
     7 Broy./Diag. 0.10E+00  175.3     0.00234601        -7.8636002502 -1.67E-02
     7 Broy./Diag. 0.10E+00  168.5     0.00234601        -7.8636002502 -1.67E-02
     8 Broy./Diag. 0.10E+00  173.3     0.00004036        -7.8654690663 -1.87E-03
     8 Broy./Diag. 0.10E+00  172.1     0.00004036        -7.8654690663 -1.87E-03
     9 Broy./Diag. 0.10E+00  172.9     0.00009081        -7.8667661112 -1.30E-03

It does seem to double each step, or is that normal? Maybe this is all due to the submission script:

#SBATCH --no-requeue
#SBATCH --job-name="aiida-201"
#SBATCH --get-user-env
#SBATCH --output=_scheduler-stdout.txt
#SBATCH --error=_scheduler-stderr.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00


ulimit -s unlimited

'mpirun' '-np' '2' '/usr/local/bin/cp2k.ssmp' '-i' 'aiida.inp'  > 'aiida.out' 2>&1
@giovannipizzi
Copy link
Member

I can reproduce this on a Mac host (running in VirtualBox), with QM 20.11.2a.
Even just running mpirun -np 2 cp2k.ssmp prints twice the "usage" message.

Do I understand correctly that the ssmp version (I guess downloaded from the GitHub releases page, looking at the ansible role) is supposed to be used with OpenMP/multithreading, and not with MPI?
Maybe @dev-zero or @oschuett can confirm?

In this case, what is the suggested/simplest way to get a compiled binary of CP2K working with MPI on Quantum Mobile (Ubuntu)?
@chrisjsewell and @yakutovicha did you try to follow the compilation instructions to get CP2K running in QMobile already?

@dev-zero
Copy link
Contributor

Yes, ssmp (or any cp2k.s*) is strictly non-MPI.

@dev-zero
Copy link
Contributor

dev-zero commented Apr 13, 2021

You should also make sure that OMP_NUM_THREADS is set properly to avoid oversubscription (something like max(16, number_of_physical_cores//number_of_ranks_per_machine) may be a good start). Depending on your scheduler it may be set by --ntasks-per-node or some MPI runtime may also set it automatically (depending on mapping options).
Also, is that really correct to use mpirun inside an sbatch script (rather than srun)?

@chrisjsewell
Copy link
Member

Do I understand correctly that the ssmp version (I guess downloaded from the GitHub releases page, looking at the ansible role) is supposed to be used with OpenMP/multithreading, and not with MPI?

Yeh basically this binary download has never worked on any Quantum Mobile, which is a little annoying to find out now (I had no part in writing it).
So basically it needs to be compiled from source each time. I asked @yakutovicha to look into this?

The other route is to use the https://github.com/conda-forge/cp2k-feedstock, which we eventually want to look into using for all simulation codes. But I think this may be too difficult to implement at this time (and also v8.1.0 is not yet released as there is still some outstanding issues for it)

@sphuber
Copy link
Collaborator Author

sphuber commented Apr 13, 2021

Also, is that really correct to use mpirun inside an sbatch script (rather than srun)?

I am not a 100% that when running with SLURM you have to run with srun. It is true that on QM we configure the localhost with SLURM but use mpirun. I just tried switching to srun -n {tot_num_mpiprocs} (which works on machines like Piz Daint for example) but now that fails when running Quantum ESPRESSO. Using the mpirun, however, works without problems. The calculation is run with two MPI process but with srun two individual parallel executions are launched that both write to the same file. Do you know when and when not srun can or should be used @dev-zero ?

I will try to look around a bit in the documentation of SLURM to see if I can find anything

@dev-zero
Copy link
Contributor

@sphuber not really, sorry. My guess is that you have to configure slurm and that srun becomes a wrapper around mpirun (or whatever command is needed to run mpi on a system). What srun usually does is it takes env vars injected into the env by sbatch and forwards them to the MPI environment (and does some more mapping, etc.). So, in this simple setting it might just be the right thing to use mpirun and the only thing srun would allow is to avoid the explicit -np parameter for mpirun.

@oschuett
Copy link

oschuett commented Apr 13, 2021

The installation of CP2K is indeed a major pain point. We now have 30+ dependency and keep adding more.

The binary we provide with the releases is hand-rolled, statically linked, and stripped-down, e.g. without MPI.

While CP2K is included in Debian and Fedora, those distributions have long release cycles. Hence, I believe the way to go are indeed package managers like Conda or Spack. Unfortunately, maintaining those packages is a lot of work.

@dev-zero
Copy link
Contributor

Probably stating the obvious, but the quickfix for now would be to limit the number of ranks to 1.

@sphuber
Copy link
Collaborator Author

sphuber commented Apr 13, 2021

That is something that we are considering adding in the input generators of the common workflow project, for which this problem is most critical now. But we cannot enforce this on the plugin level and so this means that CP2K is broken on QM for any other calculation where the user selects more than 1 rank. So we will have to find a solution at some point if we want CP2K to run reliably on QM

@ltalirz
Copy link
Member

ltalirz commented Oct 10, 2022

Just mentioning for anyone needing to run cp2k on the quantum mobile:

The following for me uses 1 process and 12 threads on the quantum mobile 21.05.1 docker container under ubuntu:

aiida-common-workflows launch eos cp2k -S Fe -p precise -s collinear --codes cp2k-7.1@localhost  --magnetization-per-site -4 4 --daemon -n 1

The calculation seems to be running fine (except for being rather slow of course ;-) ).

@yakutovicha
Copy link
Contributor

By the way, what is the status of this? I believe the issue could be solved by installing CP2K from conda-forge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants