Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mpi crushed in Q6 intel- Rackham cluster #2

Open
klaudia-dais opened this issue Oct 25, 2017 · 20 comments
Open

Mpi crushed in Q6 intel- Rackham cluster #2

klaudia-dais opened this issue Oct 25, 2017 · 20 comments
Assignees
Labels

Comments

@klaudia-dais
Copy link

I compiled Q on Rackham (intel) and with mpi it shows error unknown option -Nmpi. But even with that it finish. When I submit job it finish imidietly with this kind of error:[r101:2335] *** An error occurred in MPI_Allreduce
[r101:2335] *** reported by process [1808072705,0]
[r101:2335] *** on communicator MPI_COMM_WORLD
[r101:2335] *** MPI_ERR_OP: invalid reduce operation
[r101:2335] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[r101:2335] ***    and potentially your MPI job)

I tried different versions of intel and with intelmpi or openmpi. Everytime crash with similar error. When I run the same job on different cluster and local with Qdyn6 it works without problem.

Any idea how to solve it?

@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

Thanks for moving this to the open issue tracker.
Please attach the workflow for compiling and the minimal run script (you can ask Miha how to reduce it to the basic parts w/o all the extra bits I added)

@acmnpv acmnpv self-assigned this Oct 25, 2017
@acmnpv acmnpv added the bug label Oct 25, 2017
@klaudia-dais
Copy link
Author

My procedure:
git clone https://www.github.com/qusers/Q6.git

cd Q6/src/

module purge

module load intel/17.4

module load openmpi/2.1.1

make all COMP=ifort

make mpi COMP=ifort

Running script (just important part):
#!/bin/bash -l
#SBATCH -J Node1_OH
#SBATCH -n 4
#SBATCH -t 00:10:00
#SBATCH -A p2011165

module purge
module load intel/18.0 intelmpi/18.0

mpirun -np 4 /home/klaudia/Q/Q6/bin/Qdyn6p relax.inp > relax.log

@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

Thanks, please also upload the input files needed for relax.inp

@qusers qusers deleted a comment from klaudia-dais Oct 25, 2017
@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

Sorry, but I meant that you attach an archive with all the files (input, topology, if needed fep file)

@klaudia-dais
Copy link
Author

run.tar.gz

@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

Perfect, thank you!

@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

Please try to build Q6 with the modules
intel/17.4 intelmpi/17.4
I saw a large number of compile warnings with intel and openmpi, so they might not be compatible.
When running your job, use
srun -n $THISCORES inside your sbatch file

@esguerra
Copy link
Member

Hej Klaudia and Paul,

Also you should comment out the -Nmpi card in the makefile if compiling with intel only, that variable doesn't exist for intelmpi.

You can try compiling just with intel like so:

module load intelmpi/18.0
module load intel/18.0
make mpi COMP=ifort
make all COMP=ifort

Using the attached makefile.

Before using the makefile make sure to do.

mv makefile.txt makefile

For some reason github doesn't allow uploads of extension less files, that's why I uploaded it with the .txt extension.

makefile.txt

The compilation takes a looooooooong time, no idea why.
Any idea why the compilation is so very slow Paul?

I will try to run your files at rackham and see what's going on too.

Cheers,

M.

@acmnpv
Copy link
Contributor

acmnpv commented Oct 25, 2017

I ran a test already with two cores and srun and it worked fine.
The compilation is slow with intel because the function inliner seems to go insane somewhere, but the inlining is needed for performance.
I need to upload a patch for the makefile, the flag doesn't do any harm but the warning is confusing, I agree.

@acmnpv
Copy link
Contributor

acmnpv commented Oct 26, 2017

If there are no more issues now I would close this one again.
Otherwise we could keep it open as a reminder that we need to fix the intel/openmpi combination.

Cheers

Paul

@acmnpv
Copy link
Contributor

acmnpv commented Oct 26, 2017

Also, Mauricio, can you make a quick pull request for the makefile (or push it yourself)?
So we can at least get rid of the annoying warnings. :D

@esguerra
Copy link
Member

Hej,

So, I am missing something from Rackham.
Klaudia's example only works with srun.
Any clue as to the reason Paul?

If I don't use srun nasty MPI messages appear, but with srun all seems fine and dandy.

screen shot 2017-10-26 at 11 25 08 am

@esguerra
Copy link
Member

I can send a pull-request with the makefile, but, first accepting your

@esguerra
Copy link
Member

Ooops, I don't know how I managed to close this.
I haven't been able to solve the MPI issues at rackham with intel.
Has Klaudia managed to solve them?

@esguerra esguerra reopened this Oct 30, 2017
@acmnpv
Copy link
Contributor

acmnpv commented Oct 31, 2017

fine with me. I did not have more time to look into this, but it reliably crashed ddt during the mpi_init part. No idea what the heck is going on, might be an issue with the mpi set-up on rackhem

@acmnpv
Copy link
Contributor

acmnpv commented Feb 6, 2018

Mauricio, did you have some more luck in testing this?

@esguerra
Copy link
Member

esguerra commented Feb 6, 2018

Hej,
Good reminder.
Last time I tried using srun had done the trick, which, is very odd, since sbatch and srun are both talking to slurm in the same way, AFAIK.
I will give it another try with the latest version and write back what I see.

@esguerra
Copy link
Member

esguerra commented Feb 6, 2018

For some reason in the Rackham cluster they have aliased mpirun to echo this:

alias mpirun='echo Please use srun'
/usr/bin/echo

They say that this is needed when using intel compiled programs, that is, changing the use of mpirun to srun.

So, compiling with:

module load intel/18.1
module load intelmpi/18.1
make all COMP=ifort

Produces a binary which works when invoked with:

srun -n 8 Qdyn6p eq1.inp > eq1.log

Paul. I guess you can close these if @klaudia-dais also sees her jobs running when Q is compiled and run in the suggested way.

@acmnpv
Copy link
Contributor

acmnpv commented Feb 7, 2018

I haven't heard anything there again, but I think we may want to keep it until we add something about this to the readme?

@esguerra
Copy link
Member

esguerra commented Feb 7, 2018

Hej,
Good idea.
M.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants