-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tp2.10 MPI_OMPH test not reproducing with different number of MPI/OMP threads #321
Comments
@ukmo-ccbunney I remember you mentioning that you thought you had looked into this before. I just found a switch: B4B that's a double switch. I wonder if these are the updates that you made for this issue before? |
@JessicaMeixner-NOAA - yes - the B4B is a "bit 4 bit" switch. I added it to get around the b4b issues we originally had with that regtests (to do with a reduction in the SMC module). When enabled, the B4B switch scales and converts some real values to integer values to avoid rounding errors during the reduction. It fixes the issue when running with the same number of PRs/threads (at least for us), but there maybe other parts of the code that cause issues when running with a different combination of PEs/threads... |
After updating the mpirun to srun on NOAA HPC, ww3_tp2.16/./work_MPI_OMPH is showing similar behavior. It is not b4b anymore. |
Thanks for the update @aliabdolali I will investigate when I get more time. |
@ukmo-ccbunney I noticed that we are do have the "B4B" flag in both the tp2.10 and the 2.16 OMPH cases, I had thought that the tp2.16 was an issue where something was not getting recompiled correctly, but I used the same switches for both tests and it was okay. (I thought this because if I ran the tp2.16 test from clean clones I could get b4b, but not if they were run after the tp2.10 test and using the tp2.10 switch did not help). |
@JessicaMeixner-NOAA I think the B4B implementation must no longer be working (or it is and there is some other issue). It did originally fix the b4b issue with the SMC propagation a few years ago, but it seems to have come back again and I am not sure why! Certainly, it seems to be implementation specific, as you have seen with your change of scheduler. |
When running:
./bin/run_test -b slurm -c hera -S -T -s MPI_OMPH -w work_MPI_OMPH -f -p mpirun -n 2 -t 6 ../model -o both ww3_tp2.10
with either different number of MPI tasks or OMP threads there are differences when comparing with matrix.comp.
The text was updated successfully, but these errors were encountered: