Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LAMMPS Parallel Job for Pitzer Expansion #89

Open
ericfranz opened this issue Aug 31, 2020 · 4 comments
Open

Fix LAMMPS Parallel Job for Pitzer Expansion #89

ericfranz opened this issue Aug 31, 2020 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@ericfranz
Copy link
Contributor

sbatch: error: Invalid numeric value "/bin/bash" for --core-spec.

revert 7dfd108 on branch and then fix

@ericfranz ericfranz added the bug Something isn't working label Aug 31, 2020
@msquee msquee self-assigned this Aug 31, 2020
@msquee
Copy link
Contributor

msquee commented Sep 2, 2020

LAMMPS script:

#!/bin/bash
#SBATCH -J ondemand/sys/myjobs/basic_lammps_parallel
#SBATCH --time=00:30:00
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=40


scontrol show job $SLURM_JOBID

module load intel/19.0.5
module load mvapich2/2.3.4
module load lammps/3Mar20

export OMP_NUM_THREADS=80 # this must match nodes * ppn

cd $TMPDIR
sbcast -p /users/appl/srb/workshops/compchem/lammps/in.crack $TMPDIR/in.crack

lammps < $TMPDIR/in.crack

ls -al

Output:

[proxy:0:0@p0595.ten.osc.edu] HYD_pmcd_pmi_args_to_tokens (./pm/pmiserv/common.c:138): assert (*count * sizeof(struct HYD_pmcd_token)) failed
[proxy:0:0@p0595.ten.osc.edu] fn_job_getid (./pm/pmiserv/pmip_pmi_v2.c:226): unable to convert args to tokens
[proxy:0:0@p0595.ten.osc.edu] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0@p0595.ten.osc.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@p0595.ten.osc.edu] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec@p0595.ten.osc.edu] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec@p0595.ten.osc.edu] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@p0595.ten.osc.edu] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec@p0595.ten.osc.edu] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion

@treydock
Copy link
Contributor

treydock commented Sep 2, 2020

Do not set OMP_NUM_THREADS to more than processors on node. Threads do not cross nodes so set either OMP_NUM_THREADS=1 with 40 processes per node or set OMP_NUM_THREADS=40 with 1 process per node. If you set OMP_NUM_THREADS=80 you will over saturate the node and perform very poorly or just simply crash the job.

$ salloc -N 2 --ntasks-per-node=48 srun --pty /bin/bash
salloc: Pending job allocation 19405
salloc: job 19405 queued and waiting for resources
salloc: job 19405 has been allocated resources
salloc: Granted job allocation 19405
salloc: Waiting for resource configuration
salloc: Nodes p[0597-0598] are ready for job


$ module load intel/19.0.5 mvapich2/2.3.4 lammps/3Mar20
$ cd $TMPDIR
$ sbcast -p /users/appl/srb/workshops/compchem/lammps/in.crack $TMPDIR/in.crack
$ export OMP_NUM_THREADS=48
$ lammps < $TMPDIR/in.crack
<no errors>

Log into node on another shell:

$ ps aux -L | grep -c lammps
2457

That's bad. That was basically 48 processes with 48 threads, that's not what you want.

Changing to OMP_NUM_THREADS=1 gets correct number of processes but this error:

[p0597.ten.osc.edu:mpi_rank_0][create_intra_sock_comm] Failed to get correct process to socket binding info.Proceeding by disabling socket aware collectives support.

@treydock
Copy link
Contributor

treydock commented Sep 2, 2020

It looks like using my example with OMP_NUM_THREADS=1 is successful just has error at the end of job maybe.

@ericfranz
Copy link
Contributor Author

In the case of this particular template, ZQ recommended we use MPI for the example instead of OpenMP. Which I think means dropping just OMP_NUM_THREADS altogether.

For example ,see the example for Pitzer at https://www.osc.edu/resources/available_software/software_list/lammps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants