Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio}[foss/2023b] GROMACS v2024.3 #21430

Merged

Conversation

bedroge
Copy link
Contributor

@bedroge bedroge commented Sep 17, 2024

(created using eb --new-pr)

Compared to previous easyconfigs, this now installs the pypi version of gmxapi. The versioning of the included gmxapi seems a bit confusing: https://gitlab.com/gromacs/gromacs/-/blob/v2024.3/python_packaging/gmxapi/pyproject.toml?ref_type=tags says 0.4.1, https://gitlab.com/gromacs/gromacs/-/blob/v2024.3/python_packaging/gmxapi/src/gmxapi/version.py?ref_type=tags shows 0.5.0a1, and the docs just recommend using the pypi version (where the latest version is 0.4.2).

@bedroge bedroge added the update label Sep 17, 2024
@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4887

Test results coming soon (I hope)...

- notification for comment with ID 2355567729 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Test report by @bedroge
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 24.04.1 LTS (Noble Numbat), x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.12.3
See https://gist.github.com/bedroge/00b77ed6bd3a5d428ec87908696c72e3 for a full test report.

edit: oops, forgot to include the fix from easybuilders/easybuild-easyblocks#3283, ran into that before...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/50ee5c7ff043d4ebd45c44d1b99799af for a full test report.

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on login1

PR test command 'EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14284

Test results coming soon (I hope)...

- notification for comment with ID 2355672238 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Sep 17, 2024

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/17e64042a902dfac5e5c3d9084e53233 for a full test report.

Test failure in GmxapiMpiTests:

File input/output error:
/tmp/boegelbot/GROMACS/2024.3/foss-2023b/easybuild_obj/api/gmxapi/cpp/tests/Testing/Temporary/GmxApiTest_RunnerChainedMD.trr

Let's try again...

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@bedroge: Request for testing this PR well received on login1

PR test command 'EB_PR=21430 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21430 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14285

Test results coming soon (I hope)...

- notification for comment with ID 2355854315 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Test report by @bedroge
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3283
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bob-Latitude-5300 - Linux Ubuntu 24.04.1 LTS (Noble Numbat), x86_64, Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, Python 3.12.3
See https://gist.github.com/bedroge/3ef052bc3b70ea4b0c73ab4eb450ade9 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/d174a112ed863aa2ef88e1386ea67320 for a full test report.

@bedroge
Copy link
Contributor Author

bedroge commented Sep 17, 2024

Also tested this with the EESSI bot for a bunch of CPUs: EESSI/software-layer#709. There it also failed on haswell with the same input/output error, so I've started another build.

@boegel
Copy link
Member

boegel commented Sep 17, 2024

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

@mabraham
Copy link

mabraham commented Sep 18, 2024

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8 See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

GROMACS dev here. I see that the following test case

[  FAILED  ] PropagatorsWithCoupling/PeriodicActionsTest.PeriodicActionsAgreeWithReference/21, where GetParam() = ({ ("comm-mode", "linear"), ("integrator", "md-vv"), ("maxGromppWarningsTolerated", "0"), ("nstcomm", "5"), ("nstpcouple", "3"), ("nsttcouple", "2"), ("pcoupl", "C-rescale"), ("simulationName", "argon12"), ("tcoupl", "v-rescale") }, 0x55e03938e489)

fails, either timing out or somehow suspended or crashed. C-rescale is a relatively new implementation, and this test case is intended to exercise dark corners of the code, so a real problem is possible.

Yet I see the preceding test case (at https://gist.github.com/boegel/75ff6503735f73f2d9ec570366bd181f#file-gromacs-2024-3-foss-2023b_partial-log-L374) took 25 seconds. On my x86 laptop with a release debug build the whole test suite takes under 4 seconds. Why is this GROMACS configuration so slow?

@boegel
Copy link
Member

boegel commented Sep 18, 2024

Test report by @boegel FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8 See https://gist.github.com/boegel/1a1c9f63138cf054dbb09e6d2b83ea0a for a full test report.

GROMACS dev here. I see that the following test case

[  FAILED  ] PropagatorsWithCoupling/PeriodicActionsTest.PeriodicActionsAgreeWithReference/21, where GetParam() = ({ ("comm-mode", "linear"), ("integrator", "md-vv"), ("maxGromppWarningsTolerated", "0"), ("nstcomm", "5"), ("nstpcouple", "3"), ("nsttcouple", "2"), ("pcoupl", "C-rescale"), ("simulationName", "argon12"), ("tcoupl", "v-rescale") }, 0x55e03938e489)

fails, either timing out or somehow suspended or crashed. C-rescale is a relatively new implementation, and this test case is intended to exercise dark corners of the code, so a real problem is possible.

Yet I see the preceding test case (at https://gist.github.com/boegel/75ff6503735f73f2d9ec570366bd181f#file-gromacs-2024-3-foss-2023b_partial-log-L374) took 25 seconds. On my x86 laptop with a release debug build the whole test suite takes under 4 seconds. Why is this GROMACS configuration so slow?

@mabraham It's probably not the GROMACS configuration itself, but the environment it's running it.

It's running in an interactive Slurm job, with 9 cores available (in a cgroup) out of a total of 36 in total on that system.
It's also an Intel Skylake system (Intel Xeon Gold 6140), which isn't exactly new.

In addition, $OMP_PROC_BIND is set to TRUE by default on that system (via a profile script).
In general, that should improve performance for OpenMP workloads, but we've seen that cause trouble before: for some software multi-threaded processes are being bound to a single core (that's definitely the case for R example), while they do start N threads, so threads are fighting for resources leading to very slow runs.
That's a known quirk of the GCC OpenMP runtime, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113698

I've seen this before, but I never got to the bottom of it for GROMACS...

If any of this rings a bell, any insights you may have are welcome.

@mabraham
Copy link

The test cases are only using two pthreads, so if the system is working as you describe, there's no ready explanation of a problem. But if the core-to-cgroup mapping is not working right, such slowdowns are plausible. Do you have / can you get data to observe core occupancy across a loaded node?

@boegel
Copy link
Member

boegel commented Oct 2, 2024

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3108.skitty.os - Linux RHEL 9.4, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.9.18
See https://gist.github.com/boegel/541a0814c04255d0d7c33dd085ae4f9c for a full test report.

@boegel
Copy link
Member

boegel commented Oct 3, 2024

The system I was testing on has been migrated from RHEL 8.8 to RHEL 9.4 since the last time I tested (17 Sept'24), and the last attempt didn't fail (see test report above)

One difference there is that this test was done on a full workernode (all 36 cores assigned to the Slurm job), so there's no cgroup effect here. I also did an unset OMP_PROC_BIND before starting the test.

I'm now retesting in a 9-core Slurm job, where cgroup is set up such that available cores are spread across the node:

$ taskset -c -p $$
pid 3455741's current affinity list: 3,7,11-13,15,19,23,33

I didn't see any failing tests after an unset OMP_PROC_BIND in that environment either.

If I keep $OMP_PROC_BIND set to TRUE however, I see this:

      Start 80: MdrunCoordinationCouplingTests2Ranks
80/89 Test #80: MdrunCoordinationCouplingTests2Ranks .........***Timeout 480.00 sec
...
The following tests FAILED:
         80 - MdrunCoordinationCouplingTests2Ranks (Timeout)

Seems to be the same PropagatorsWithCoupling/PeriodicActionsTest.PeriodicActionsAgreeWithReference/21 as before.

That's not a total surprise though, we've seen other situations where setting $OMP_PROC_BIND to TRUE means trouble...

So long story short: friends don't let friends set $OMP_PROC_BIND to TRUE when running the GROMACS test suite?

@mabraham Not sure if it makes sense to integrate the "unset OMP_PROC_BIND if set" in the GROMACS test suite or not...

@mabraham
Copy link

mabraham commented Oct 3, 2024

It certainly makes sense for you to integrate that unset call in your runner. By default, GROMACS does try to respect existing thread-affinity settings, but if it detects none, then it sets them itself. The main simulation engine itself has a command-line flag to specify behavior here, but the these test binaries just do the default. However the default only checks GOMP_CPU_AFFINITY, and no OMP_* variables, which looks like an omission.

@mabraham
Copy link

mabraham commented Oct 3, 2024

I made https://gitlab.com/gromacs/gromacs/-/issues/5170 to follow up

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Oct 4, 2024

Going in, thanks @bedroge!

@boegel boegel merged commit 8e4d509 into easybuilders:develop Oct 4, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants