Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMake MPI preset: update logic and extend documentation #1300

Merged
merged 18 commits into from
Mar 25, 2025

Conversation

albestro
Copy link
Collaborator

@albestro albestro commented Mar 4, 2025

Changelog:

  • replace MPIEXEC_NUMCORES with MPIEXEC_NUMCORES_PER_RANK (see details below)
  • MPIEXEC_MAX_NUMPROCS is considered just for plain-mpi
  • add missing doc and extend with information about MPIEXEC_MAX_NUMPROCS/MPIEXEC_NUMCORES_PER_RANK usage
  • moved dlaf_setup_mpi_preset to DLAF_AddTest cmake script (it seems a more appropriate place)

Depending on the DLAF_MPI_PRESET we have two different behaviours.

PRESET "plain-mpi"
--pika:bind=none --pika:threads=N
with N=
"slurm"/"custom"
srun -n MPIRANKS ${MPIEXEC_NUMCORE_FLAG}=N
with N=
"before" MPIEXEC_MAX_NUMPROCS/MPIRANKS MPIEXEC_NUMCORES/MPIRANKS
"after" MPIEXEC_MAX_NUMPROCS/MPIRANKS MPIEXEC_NUMCORES_PER_RANK

The behaviour will change just for slurm/custom preset.

Some notes/reasons for the change are:

  • now it is possible to configure a test over more than 1 nodes (previously it was problematic, e.g. 6 ranks case on 2 nodes)
  • by fixing the number of cores we rely on slurm (and the user providing a valid configuration value) to not have ranks across modules (e.g. 1 rank will not use all nodes resources, ending up using more modules with a single GPU)

MPIEXEC_MAX_NUMPROCS is used just for plain-mpi and not in other presets. I'm not sure how helpful would be to have setting it also for other presets and what kind of additional check we could add (e.g. some kind of upper bound check on configuration).

@albestro albestro added this to the v0.9.0 milestone Mar 4, 2025
@albestro albestro requested review from msimberg, rasolca and RMeli March 4, 2025 09:24
@albestro albestro self-assigned this Mar 4, 2025
@albestro albestro force-pushed the alby/minor-doc-fix-mpi branch from 368dc83 to a1cf1c5 Compare March 4, 2025 10:16
@albestro
Copy link
Collaborator Author

albestro commented Mar 5, 2025

On ALPS Daint

  -DDLAF_MPI_PRESET=slurm\
  -DMPIEXEC_NUMCORES_PER_RANK=32\
  -DMPIEXEC_PREFLAGS="--gpus-per-task=1"\

(an extent of the full list, just for reference)

24: Test command: /capstor/scratch/cscs/ialberto/dla-future.develop/build/test/unit/common/test_single_threaded_blas
25: Test command: /usr/bin/srun "-n" "2" "-c" "32" "--gpus-per-task=1" "/capstor/scratch/cscs/ialberto/dla-future.develop/build/test/unit/communication/test_message"
26: Test command: /usr/bin/srun "-n" "4" "-c" "32" "--gpus-per-task=1" "/capstor/scratch/cscs/ialberto/dla-future.develop/build/test/unit/communication/test_communicator"
27: Test command: /usr/bin/srun "-n" "6" "-c" "32" "--gpus-per-task=1" "/capstor/scratch/cscs/ialberto/dla-future.develop/build/test/unit/communication/test_communicator_grid"

@rasolca
Copy link
Collaborator

rasolca commented Mar 6, 2025

cscs-ci run

1 similar comment
@rasolca
Copy link
Collaborator

rasolca commented Mar 7, 2025

cscs-ci run

"Flag used by MPI to specify the number of cores per rank for mpiexec. If not empty, you have to specify also the number of cores available per node in MPIEXEC_NUMCORES_PER_RANK."
FORCE
)
set(MPIEXEC_NUMCORES_PER_RANK "1" CACHE STRING "Number of cores used by each MPI rank.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good default?

@albestro albestro force-pushed the alby/minor-doc-fix-mpi branch from 4af1881 to 1b8f898 Compare March 17, 2025 09:34
@rasolca
Copy link
Collaborator

rasolca commented Mar 17, 2025

cscs-ci run

@msimberg
Copy link
Collaborator

@albestro you may already have noticed, but in case not: something in the cmake config does not behave as before for CI: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/9430728485.

@albestro albestro force-pushed the alby/minor-doc-fix-mpi branch from 0de4aaa to aaccfa3 Compare March 21, 2025 15:10
@albestro
Copy link
Collaborator Author

cscs-ci run

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.07%. Comparing base (9f7a730) to head (aaccfa3).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1300   +/-   ##
=======================================
  Coverage   95.07%   95.07%           
=======================================
  Files         141      141           
  Lines        8654     8654           
  Branches     1110     1110           
=======================================
  Hits         8228     8228           
  Misses        239      239           
  Partials      187      187           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albestro just because it's difficult to see from the pushed changes, what was the issue in CI? What did you change to fix it?

@albestro
Copy link
Collaborator Author

@albestro just because it's difficult to see from the pushed changes, what was the issue in CI? What did you change to fix it?

I think I've got bite by CMake lists, since I forgot to quote docstrings.

The commit failing in CI was 3bbb283, so that we can look at the status of the code at that time.

In particular the problem was that this docstring was defined as a "list" of strings (for formatting convenience) instead of a simple string

set(MPIEXEC_NUMPROC_FLAG_DESCRIPTION
"Flag used by MPI to specify the number of processes for mpiexec; "
"the next option will be the number of processes."
)

Then it was getting used unquoted

set(MPIEXEC_NUMCORE_FLAG "" CACHE STRING ${MPIEXEC_NUMCORE_FLAG_DESCRIPTION})

Resulting in CMake setting the variable value with a list instead of picking the CACHE version of set. The problem wasn't that it was not setting a CACHE variable.
Instead the value of the variable instead of being set to empty, it was set to a list with values like the following one

';CACHE;STRING;Flag used by MPI to specify the number of cores per rank for mpiexec. ;If not empty, you have to specify also the number of cores available per rank in MPIEXEC_NUMCORES_PER_RANK.'

which for what concerns the check this variable cannot be considered empty.

Curiosity: this check would have not triggered if both variables in the check had a docstring hard-coded with a list of string instead of a simple string (it wasn't the case because the other one didn't need multiple strings for formatting convenience).

I fixed it by using docstrings quoted, so CMake turns a list in a space separated string (as we need it).

@rasolca rasolca merged commit 7ce9103 into master Mar 25, 2025
6 checks passed
@rasolca rasolca deleted the alby/minor-doc-fix-mpi branch March 25, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants