Mpi crushed in Q6 intel- Rackham cluster #2

klaudia-dais · 2017-10-25T10:09:56Z

I compiled Q on Rackham (intel) and with mpi it shows error unknown option -Nmpi. But even with that it finish. When I submit job it finish imidietly with this kind of error:[r101:2335] *** An error occurred in MPI_Allreduce
[r101:2335] *** reported by process [1808072705,0]
[r101:2335] *** on communicator MPI_COMM_WORLD
[r101:2335] *** MPI_ERR_OP: invalid reduce operation
[r101:2335] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[r101:2335] *** and potentially your MPI job)

I tried different versions of intel and with intelmpi or openmpi. Everytime crash with similar error. When I run the same job on different cluster and local with Qdyn6 it works without problem.

Any idea how to solve it?

acmnpv · 2017-10-25T10:11:38Z

Thanks for moving this to the open issue tracker.
Please attach the workflow for compiling and the minimal run script (you can ask Miha how to reduce it to the basic parts w/o all the extra bits I added)

klaudia-dais · 2017-10-25T10:14:43Z

My procedure:
git clone https://www.github.com/qusers/Q6.git

cd Q6/src/

module purge

module load intel/17.4

module load openmpi/2.1.1

make all COMP=ifort

make mpi COMP=ifort

Running script (just important part):
#!/bin/bash -l
#SBATCH -J Node1_OH
#SBATCH -n 4
#SBATCH -t 00:10:00
#SBATCH -A p2011165

module purge
module load intel/18.0 intelmpi/18.0

mpirun -np 4 /home/klaudia/Q/Q6/bin/Qdyn6p relax.inp > relax.log

acmnpv · 2017-10-25T10:19:12Z

Thanks, please also upload the input files needed for relax.inp

acmnpv · 2017-10-25T10:37:47Z

Sorry, but I meant that you attach an archive with all the files (input, topology, if needed fep file)

klaudia-dais · 2017-10-25T10:46:56Z

run.tar.gz

acmnpv · 2017-10-25T10:47:31Z

Perfect, thank you!

acmnpv · 2017-10-25T11:45:11Z

Please try to build Q6 with the modules
intel/17.4 intelmpi/17.4
I saw a large number of compile warnings with intel and openmpi, so they might not be compatible.
When running your job, use
srun -n $THISCORES inside your sbatch file

esguerra · 2017-10-25T11:57:42Z

Hej Klaudia and Paul,

Also you should comment out the -Nmpi card in the makefile if compiling with intel only, that variable doesn't exist for intelmpi.

You can try compiling just with intel like so:

module load intelmpi/18.0
module load intel/18.0
make mpi COMP=ifort
make all COMP=ifort

Using the attached makefile.

Before using the makefile make sure to do.

mv makefile.txt makefile

For some reason github doesn't allow uploads of extension less files, that's why I uploaded it with the .txt extension.

makefile.txt

The compilation takes a looooooooong time, no idea why.
Any idea why the compilation is so very slow Paul?

I will try to run your files at rackham and see what's going on too.

Cheers,

M.

acmnpv · 2017-10-25T11:59:45Z

I ran a test already with two cores and srun and it worked fine.
The compilation is slow with intel because the function inliner seems to go insane somewhere, but the inlining is needed for performance.
I need to upload a patch for the makefile, the flag doesn't do any harm but the warning is confusing, I agree.

acmnpv · 2017-10-26T09:15:10Z

If there are no more issues now I would close this one again.
Otherwise we could keep it open as a reminder that we need to fix the intel/openmpi combination.

Cheers

Paul

acmnpv · 2017-10-26T09:18:35Z

Also, Mauricio, can you make a quick pull request for the makefile (or push it yourself)?
So we can at least get rid of the annoying warnings. :D

esguerra · 2017-10-26T09:26:20Z

Hej,

So, I am missing something from Rackham.
Klaudia's example only works with srun.
Any clue as to the reason Paul?

If I don't use srun nasty MPI messages appear, but with srun all seems fine and dandy.

esguerra · 2017-10-26T09:30:13Z

I can send a pull-request with the makefile, but, first accepting your

esguerra · 2017-10-30T20:27:04Z

Ooops, I don't know how I managed to close this.
I haven't been able to solve the MPI issues at rackham with intel.
Has Klaudia managed to solve them?

acmnpv · 2017-10-31T09:52:03Z

fine with me. I did not have more time to look into this, but it reliably crashed ddt during the mpi_init part. No idea what the heck is going on, might be an issue with the mpi set-up on rackhem

acmnpv · 2018-02-06T08:45:27Z

Mauricio, did you have some more luck in testing this?

esguerra · 2018-02-06T09:49:32Z

Hej,
Good reminder.
Last time I tried using srun had done the trick, which, is very odd, since sbatch and srun are both talking to slurm in the same way, AFAIK.
I will give it another try with the latest version and write back what I see.

esguerra · 2018-02-06T15:18:00Z

For some reason in the Rackham cluster they have aliased mpirun to echo this:

alias mpirun='echo Please use srun'
/usr/bin/echo

They say that this is needed when using intel compiled programs, that is, changing the use of mpirun to srun.

So, compiling with:

module load intel/18.1
module load intelmpi/18.1
make all COMP=ifort

Produces a binary which works when invoked with:

srun -n 8 Qdyn6p eq1.inp > eq1.log

Paul. I guess you can close these if @klaudia-dais also sees her jobs running when Q is compiled and run in the suggested way.

acmnpv · 2018-02-07T10:21:03Z

I haven't heard anything there again, but I think we may want to keep it until we add something about this to the readme?

esguerra · 2018-02-07T11:12:12Z

Hej,
Good idea.
M.

acmnpv self-assigned this Oct 25, 2017

acmnpv added the bug label Oct 25, 2017

qusers deleted a comment from klaudia-dais Oct 25, 2017

esguerra closed this as completed Oct 26, 2017

esguerra reopened this Oct 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi crushed in Q6 intel- Rackham cluster #2

Mpi crushed in Q6 intel- Rackham cluster #2

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 25, 2017

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 25, 2017

esguerra commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 26, 2017

acmnpv commented Oct 26, 2017

esguerra commented Oct 26, 2017

esguerra commented Oct 26, 2017

esguerra commented Oct 30, 2017

acmnpv commented Oct 31, 2017

acmnpv commented Feb 6, 2018

esguerra commented Feb 6, 2018

esguerra commented Feb 6, 2018

acmnpv commented Feb 7, 2018

esguerra commented Feb 7, 2018

Mpi crushed in Q6 intel- Rackham cluster #2

Mpi crushed in Q6 intel- Rackham cluster #2

Comments

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 25, 2017

klaudia-dais commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 25, 2017

esguerra commented Oct 25, 2017

acmnpv commented Oct 25, 2017

acmnpv commented Oct 26, 2017

acmnpv commented Oct 26, 2017

esguerra commented Oct 26, 2017

esguerra commented Oct 26, 2017

esguerra commented Oct 30, 2017

acmnpv commented Oct 31, 2017

acmnpv commented Feb 6, 2018

esguerra commented Feb 6, 2018

esguerra commented Feb 6, 2018

acmnpv commented Feb 7, 2018

esguerra commented Feb 7, 2018