You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran on the CSD3 Ice Lakes with each node comprising a total of 76 cores (38 per socket). The different job size can be found in the table below
Number of nodes
Total number of MPI tasks
Number of OpenMP threads per task
Total CPUs
1
19
4
76
2
38
4
152
4
76
4
304
8
152
4
608
16
304
4
1216
32
608
4
2032
I used SLURM job arrays to run each simulation 4 times and then took the mean of the results. A template jobscript can be found below
Show template jobscript
#!/bin/bash#!##############################################################!#### Modify the options in this section as appropriate #######!##############################################################! sbatch directives begin here ################################! Name of the job:#SBATCH -J GRC-scaling-test#! Which project should be charged:#SBATCH -A DIRAC-DP002-CPU#SBATCH -p icelake#! How many whole nodes should be allocated?#SBATCH --nodes=<number of nodes>#! How many tasks per node#SBATCH --ntasks-per-node=19#! How many cores per task#SBATCH -c 4#! How much wallclock time will be required?#SBATCH --time=0:30:00#! What types of email messages do you wish to receive?#SBATCH --mail-type=all#SBATCH --array=1-4#! sbatch directives end here (put any additional directives above this line)#! Notes:#! Charging is determined by cpu number*walltime.#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$SLURM_NTASKS_PER_NODE#! #############################################################! Modify the settings below to specify the application's environment, location#! and launch method:#! Optionally modify the environment seen by the application#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):. /etc/profile.d/modules.sh # Leave this line (enables the module command)
module purge # Removes all modules still loaded
module restore grchombo-intel-2021.6-icl # module collection with modules above#! Full path to application executable:
application="/path/to/executable"#! Run options for the application:
options="$SLURM_SUBMIT_DIR/params.txt"#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR/run${SLURM_ARRAY_TASK_ID}"export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASKexport OMP_PROC_BIND=true
CMD="srun -K -c $SLURM_CPUS_PER_TASK --distribution=block:block $application$options"################################################################## You should not have to change anything below this line ###################################################################
mkdir -p $workdircd$workdirecho -e "Changed directory to `pwd`.\n"
JOBID=$SLURM_JOB_IDecho -e "JobID: $JOBID\n======"echo"Time: `date`"echo"Running on master node: `hostname`"echo"Current directory: `pwd`"if [ "$SLURM_JOB_NODELIST" ];then#! Create a machine file:export NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE| uniq > machine.file.$JOBIDecho -e "\nNodes allocated:\n================"echo`cat machine.file.$JOBID| sed -e 's/\..*$//g'`fi
module list
echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"echo -e "\nExecuting command:\n==================\n$CMD\n"eval$CMD
Parameters
The parameter files for each code can be found by following the links below
Note that I fixed the box size to $16^3$ in order to maximize scaling efficiency.
Results
The average walltime to complete 2 timesteps on the coarsest timesteps is shown in the plot below. Perfect scaling is shown with the dashed line.
The strong scaling efficiency (i.e. a value of $1.0$ means that increasing the number of nodes/CPUs by a factor of $F$ would mean that the simulation walltime would decrease by a factor of $F$) is shown in the plot below.
Discussion
Although the tagging parameters have been chosen the same for both GRChombo and GRTeclyn, inspecting the output shows that GRChombo consistently refines a larger region of the grid. This explains at least some of the difference between the significantly shorter walltimes for GRTeclyn compared to GRChombo (note that the log scale on the plot diminishes the apparent difference). Because the GRChombo simulation configuration has larger refined regions and thus more cells to evolve, it's not surprising that better strong scaling is then achieved for the larger number of nodes as GRTeclyn reaches its bottlenecks faster.
The only slightly surprising result is that the GRChombo simulation with the largest job size (32 nodes) completes a little faster than the GRTeclyn simulation of the same job size.
It might be worth re-running this scaling analysis after determining the largest possible simulation that can fit on a single Ice Lake node.
The text was updated successfully, but these errors were encountered:
Strong scaling test
I have run a strong scaling test of the BinaryBH example on the CSD3 Ice Lakes using GRTeclyn and run an analogous test with GRChombo for comparison
Configuration details
Build options
I used the modules loaded by the
rhel8/default-icl
module. In particular, the relevant ones areGRTeclyn
b752027c1
COMP = intel-llvm
withCXXFLAGS += -ipo -xICELAKE-SERVER -fp-model=fast -cxx=icpx
USE_MPI = TRUE
USE_OMP = TRUE
GRChombo
Chombo at
38a95f8
GRChombo at
b37dd59
Additional modules:
Show Make.defs.local
Run time configuration
I ran on the CSD3 Ice Lakes with each node comprising a total of 76 cores (38 per socket). The different job size can be found in the table below
I used SLURM job arrays to run each simulation 4 times and then took the mean of the results. A template jobscript can be found below
Show template jobscript
Parameters
The parameter files for each code can be found by following the links below
Note that I fixed the box size to$16^3$ in order to maximize scaling efficiency.
Results
The average walltime to complete 2 timesteps on the coarsest timesteps is shown in the plot below. Perfect scaling is shown with the dashed line.
The strong scaling efficiency (i.e. a value of$1.0$ means that increasing the number of nodes/CPUs by a factor of $F$ would mean that the simulation walltime would decrease by a factor of $F$ ) is shown in the plot below.
Discussion
Although the tagging parameters have been chosen the same for both GRChombo and GRTeclyn, inspecting the output shows that GRChombo consistently refines a larger region of the grid. This explains at least some of the difference between the significantly shorter walltimes for GRTeclyn compared to GRChombo (note that the log scale on the plot diminishes the apparent difference). Because the GRChombo simulation configuration has larger refined regions and thus more cells to evolve, it's not surprising that better strong scaling is then achieved for the larger number of nodes as GRTeclyn reaches its bottlenecks faster.
The only slightly surprising result is that the GRChombo simulation with the largest job size (32 nodes) completes a little faster than the GRTeclyn simulation of the same job size.
It might be worth re-running this scaling analysis after determining the largest possible simulation that can fit on a single Ice Lake node.
The text was updated successfully, but these errors were encountered: