NPROC > 1 not working #194

raulleoncz · 2024-02-28T08:08:00Z

Hello Mr. @bch0w,

I'm trying to use the MPI option for nproc>1. I already compiled specfem2d using FC=ifort, CC=icc and MPIFC=mpiifort but I'm getting this error:

Captura de pantalla 2024-02-27 a la(s) 10 57 16 p m

The parameters than I'm using for the simulation are:

system: workstation
optimize: LBFGS
nproc: 20
ntask: 12
mpiexec: mpirun

Can you help me to understand why I'm getting this error?
PD. I also try to run the example 1 with nproc 4 and I got this error:

I hope you can help me.
Thanks.

bch0w · 2024-03-01T01:55:04Z

Hi @raulleoncz, sorry you're having issues with the example problem, thanks for providing the error messages.

Starting with the second issue, I think this is coming from an update to the SPECFEM2D parameter file that has broken one of the functionalities used in the example (related #196). I'll have to make an update to the code to fix this, sorry!

Regarding your first issue, seems like there is some trouble reading your SPECFEM model, if I am reading the error message correctly, it seems like all 20 parts of the model file may be empty? Are you able to check the outputs of meshfem/specfem to make sure they ran properly?

bch0w · 2024-03-01T03:04:37Z

Hi @raulleoncz, I think I fixed the second issue you were seeing in #197 and the subsequent devel commits (if you are using devel branch). Can you please update and let me know if that solves that issue?

raulleoncz · 2024-03-01T04:01:49Z

Hi @bch0w, I already ran the example using the devel branch. I didn't get the previous error but I got this:

Captura de pantalla 2024-02-29 a la(s) 7 52 48 p m

bch0w · 2024-03-01T18:15:39Z

Hi @raulleoncz, woops sorry there was a missing import statement there, I've added that to the latest commit (9c2c082).
However that code block was behind a Timeout error so you would have encountered the following error message:

seisflows/seisflows/system/workstation.py

Lines 256 to 259 in 9c2c082

    
           f"System Task {task_id} has exceeded the " 
        
           f"defined tasktime {tasktime}m. Please check " 
        
           f"log files and adjust `tasktime` if required",  
        
           header="timeout error",  border="=")

That suggests that something may be going wrong with your forward simulation, either you need to increase tasktime to allow the simulation time to finish, or check the output logs in scratch/solver/mainsolver/fwd_solver.log to see if there is something going wrong with specfem2d

raulleoncz · 2024-03-02T05:09:51Z

Hi @bch0w.
You are right, after modifying the 'tasktime' in example 1 the simulation runs without any troubles.

On the other hand, in the first error I showed above, I have checked the .bin files when running for nproc>1. Fortunately, specfem has added a python script to visualize the 'proc000....bin' files and those files look correct.

--- Update ---

I have been checking the example's files and the first thing I noticed is that xmeshfem2D was ran with mpi (first thing I was doing differently):
"mpirun -n $NPROC ./bin/xmeshfem2D"
something that produces files like mesh0000{number of proc}_{variable}.vtk

Also, when looking the mesher_log.txt we can see that the total number of elements were divided equally, this means that each processor has (in the example case) 400 elements. Comparing my simulation with the example, I see that this condition is not being met. For example, my simulation has 58871, 56329, 57523 and 57677 elements per processor. Is it possible that this affects the simulation?

Thanks for the help.

raulleoncz · 2024-03-08T00:18:11Z

Based on the last idea, I run forward simulation using a .xyz file and looking for an equally distribution of the spectral elements, literally running xmeshfem2D again and again. After having same number of elements in both init and true models I submitted the job and the first time I got this error:

The external numerical solver has returned a nonzero exit code (failure).
Consider stopping any currently running jobs to avoid wasted computational
resources. Check 'scratch/solver/mainsolver/fwd_solver.log' for the solvers
stdout log message. The failing command and error message are:

exc: mpirun -n 4 bin/xspecfem2D
err: Command 'mpirun -n 4 bin/xspecfem2D' returned non-zero exit status 2.

bch0w · 2024-03-11T17:49:06Z

Hi @raulleoncz, sorry for the slow response here, I'm still trying to figure out the exact issue you're facing.

I have been checking the example's files and the first thing I noticed is that xmeshfem2D was ran with mpi (first thing I was doing differently): "mpirun -n $NPROC ./bin/xmeshfem2D" something that produces files like mesh0000{number of proc}_{variable}.vtk

When you run SPECFEM with nproc > 1 then it is natural for the mesh and simulation to be split over many processors, so this seems fine and expected.

Also, when looking the mesher_log.txt we can see that the total number of elements were divided equally, this means that each processor has (in the example case) 400 elements. Comparing my simulation with the example, I see that this condition is not being met. For example, my simulation has 58871, 56329, 57523 and 57677 elements per processor. Is it possible that this affects the simulation?

I suspect something is going wrong with meshfem or specfem, do you mind sharing the following log files, you can probably attach them to your message directly or in a zip file, that would help diagnose the problem.

scratch/solver/mainsolver/fwd_solver.log
scratch/solver/mainsolver/fwd_mesher.log
scratch/solver/mainsolver/OUTPUT_FILES/output_mesher.txt
scratch/solver/mainsolver/OUTPUT_FILES/output_solver.txt

raulleoncz · 2024-03-19T20:12:34Z

Hello @bch0w, I'm sorry for my slow response.

I was trying to run the simulations again but I wasn't able to get the same mesh partition. The error that I got is the same as in the first image ("The array has an inhomogeneous shape") and because of that I'm not able to add the log files.

Just to give more information, I tried with the last version of Specfem2D (devel brach 8.1.0) and the version of the example 1. Both of them worked as it should be but when I wanted to run another example of specfem, let's say the "tomographic_ocean_model" example, I faced the same error. I don't know if the gcc, mpif90 and gfortran version has something to do. Just in case, I'm using openmpi-gcc12 and fftw 3.3.10_0+gfortran.

Are you able to run the simulations with mpirun? Maybe I'm using a wrong version or configuration.

bch0w · 2024-03-28T19:26:19Z

Hi @raulleoncz, if I'm understanding correctly, this sounds more like a SPECFEM2D issue than a SeisFlows issue. Similarly the SeisFlows examples are really only configured to run a very specific SPECFEM2D problem so there is no guarantee that switching to a different example will work. I'd encourage you to open an issue with SPECFEM (https://github.com/SPECFEM/specfem2d/issues) and hopefully you can get some more targeted feedback.

raulleoncz · 2024-04-10T08:00:07Z

Hello @bch0w, I am really sorry for my late response.
I have been investigating and what I found is that after dividing the elements, each processor has their own number of elements and what it is happening is that we are like trying to combine arrays of different shapes. I found that here: https://www.golinuxcloud.com/setting-an-array-element-with-a-sequence/

In one of my examples, I was using 4 processors and the elements were 58871, 56329, 57523 and 57677 elements per processor and it did not worked. I already tried to run one example with 2 processors and it worked because the elements were the same in both processors. I will add below the fwd_mesher.log and fwd_solver.log.
fwd_mesher.log
fwd_solver.log

Also, I add pictures of the domain of each processor:

I hope this information can be useful...

bch0w mentioned this issue Mar 1, 2024

Hotfix: Fixing incompatibility with updated SPECFEM2D parameter file #197

Merged

bch0w added the question label Mar 19, 2024

raulleoncz mentioned this issue Jun 3, 2024

Bugfix Model class incomaptible with inhomogeneous arrays #215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPROC > 1 not working #194

NPROC > 1 not working #194

raulleoncz commented Feb 28, 2024 •

edited

Loading

bch0w commented Mar 1, 2024

bch0w commented Mar 1, 2024

raulleoncz commented Mar 1, 2024

bch0w commented Mar 1, 2024

raulleoncz commented Mar 2, 2024 •

edited

Loading

raulleoncz commented Mar 8, 2024

bch0w commented Mar 11, 2024

raulleoncz commented Mar 19, 2024

bch0w commented Mar 28, 2024

raulleoncz commented Apr 10, 2024 •

edited

Loading

NPROC > 1 not working #194

NPROC > 1 not working #194

Comments

raulleoncz commented Feb 28, 2024 • edited Loading

bch0w commented Mar 1, 2024

bch0w commented Mar 1, 2024

raulleoncz commented Mar 1, 2024

bch0w commented Mar 1, 2024

raulleoncz commented Mar 2, 2024 • edited Loading

raulleoncz commented Mar 8, 2024

bch0w commented Mar 11, 2024

raulleoncz commented Mar 19, 2024

bch0w commented Mar 28, 2024

raulleoncz commented Apr 10, 2024 • edited Loading

raulleoncz commented Feb 28, 2024 •

edited

Loading

raulleoncz commented Mar 2, 2024 •

edited

Loading

raulleoncz commented Apr 10, 2024 •

edited

Loading