Skip to content

Commit 3646610

Browse files
Rename flux_executor_pmi_mode to pmi_mode (#762)
* Debug slurm pmi options * Update pipeline.yml * disable MPI parallel test * sinfo * update output * add mpi parallel test * core validation is only possible on compute node not on login node * fix test * enforce pmix for testing * Update slurmspawner.py * downgrade to openmpi * remove pmi setting * use slurm environment * Try pmix again * Update slurmspawner.py * Update slurmspawner.py * Update pipeline.yml * Update pipeline.yml * Update slurmspawner.py * Update pipeline.yml * use slrum args * extend test * fix tests * Add executor_pmi_mode option * check all files * fixes * extend tests * rename to pmi_mode * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 2545358 commit 3646610

20 files changed

+628
-77
lines changed

.github/workflows/pipeline.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,6 +280,8 @@ jobs:
280280
- uses: actions/checkout@v4
281281
- uses: koesterlab/setup-slurm-action@v1
282282
timeout-minutes: 5
283+
- name: ubnuntu install
284+
run: sudo apt install -y mpich
283285
- name: Conda config
284286
shell: bash -l {0}
285287
run: echo -e "channels:\n - conda-forge\n" > .condarc
@@ -295,8 +297,10 @@ jobs:
295297
run: |
296298
pip install . --no-deps --no-build-isolation
297299
cd tests
298-
python -m unittest test_slurmclusterexecutor.py
300+
sinfo -o "%n %e %m %a %c %C"
301+
srun --mpi=list
299302
python -m unittest test_slurmjobexecutor.py
303+
python -m unittest test_slurmclusterexecutor.py
300304
301305
unittest_mpich:
302306
needs: [black]

docs/installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ For the version 5 of openmpi the backend changed to `pmix`, this requires the ad
120120
```
121121
conda install -c conda-forge flux-core flux-sched flux-pmix openmpi>=5 executorlib
122122
```
123-
In addition, the `flux_executor_pmi_mode="pmix"` parameter has to be set for the `FluxJobExecutor` or the
123+
In addition, the `pmi_mode="pmix"` parameter has to be set for the `FluxJobExecutor` or the
124124
`FluxClusterExecutor` to switch to `pmix` as backend.
125125

126126
### Test Flux Framework

executorlib/executor/flux.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@ class FluxJobExecutor(BaseExecutor):
4343
compute notes. Defaults to False.
4444
- error_log_file (str): Name of the error log file to use for storing exceptions raised
4545
by the Python functions submitted to the Executor.
46+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
4647
flux_executor (flux.job.FluxExecutor): Flux Python interface to submit the workers to flux
47-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
4848
flux_executor_nesting (bool): Provide hierarchically nested Flux job scheduler inside the submitted function.
4949
flux_log_files (bool, optional): Write flux stdout and stderr files. Defaults to False.
5050
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
@@ -93,8 +93,8 @@ def __init__(
9393
cache_directory: Optional[str] = None,
9494
max_cores: Optional[int] = None,
9595
resource_dict: Optional[dict] = None,
96+
pmi_mode: Optional[str] = None,
9697
flux_executor=None,
97-
flux_executor_pmi_mode: Optional[str] = None,
9898
flux_executor_nesting: bool = False,
9999
flux_log_files: bool = False,
100100
hostname_localhost: Optional[bool] = None,
@@ -130,8 +130,8 @@ def __init__(
130130
compute notes. Defaults to False.
131131
- error_log_file (str): Name of the error log file to use for storing exceptions
132132
raised by the Python functions submitted to the Executor.
133+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
133134
flux_executor (flux.job.FluxExecutor): Flux Python interface to submit the workers to flux
134-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
135135
flux_executor_nesting (bool): Provide hierarchically nested Flux job scheduler inside the submitted function.
136136
flux_log_files (bool, optional): Write flux stdout and stderr files. Defaults to False.
137137
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
@@ -175,8 +175,8 @@ def __init__(
175175
cache_directory=cache_directory,
176176
max_cores=max_cores,
177177
resource_dict=resource_dict,
178+
pmi_mode=pmi_mode,
178179
flux_executor=flux_executor,
179-
flux_executor_pmi_mode=flux_executor_pmi_mode,
180180
flux_executor_nesting=flux_executor_nesting,
181181
flux_log_files=flux_log_files,
182182
hostname_localhost=hostname_localhost,
@@ -199,8 +199,8 @@ def __init__(
199199
cache_directory=cache_directory,
200200
max_cores=max_cores,
201201
resource_dict=resource_dict,
202+
pmi_mode=pmi_mode,
202203
flux_executor=flux_executor,
203-
flux_executor_pmi_mode=flux_executor_pmi_mode,
204204
flux_executor_nesting=flux_executor_nesting,
205205
flux_log_files=flux_log_files,
206206
hostname_localhost=hostname_localhost,
@@ -236,7 +236,7 @@ class FluxClusterExecutor(BaseExecutor):
236236
- error_log_file (str): Name of the error log file to use for storing exceptions raised
237237
by the Python functions submitted to the Executor.
238238
pysqa_config_directory (str, optional): path to the pysqa config directory (only for pysqa based backend).
239-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
239+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
240240
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
241241
context of an HPC cluster this essential to be able to communicate to an
242242
Executor running on a different compute node within the same allocation. And
@@ -284,7 +284,7 @@ def __init__(
284284
max_cores: Optional[int] = None,
285285
resource_dict: Optional[dict] = None,
286286
pysqa_config_directory: Optional[str] = None,
287-
flux_executor_pmi_mode: Optional[str] = None,
287+
pmi_mode: Optional[str] = None,
288288
hostname_localhost: Optional[bool] = None,
289289
block_allocation: bool = False,
290290
init_function: Optional[Callable] = None,
@@ -319,7 +319,7 @@ def __init__(
319319
- error_log_file (str): Name of the error log file to use for storing exceptions
320320
raised by the Python functions submitted to the Executor.
321321
pysqa_config_directory (str, optional): path to the pysqa config directory (only for pysqa based backend).
322-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
322+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
323323
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
324324
context of an HPC cluster this essential to be able to communicate to an
325325
Executor running on a different compute node within the same allocation. And
@@ -369,7 +369,7 @@ def __init__(
369369
cache_directory=cache_directory,
370370
resource_dict=resource_dict,
371371
flux_executor=None,
372-
flux_executor_pmi_mode=flux_executor_pmi_mode,
372+
pmi_mode=pmi_mode,
373373
flux_executor_nesting=False,
374374
flux_log_files=False,
375375
pysqa_config_directory=pysqa_config_directory,
@@ -387,8 +387,8 @@ def __init__(
387387
cache_directory=cache_directory,
388388
max_cores=max_cores,
389389
resource_dict=resource_dict,
390+
pmi_mode=None,
390391
flux_executor=None,
391-
flux_executor_pmi_mode=None,
392392
flux_executor_nesting=False,
393393
flux_log_files=False,
394394
hostname_localhost=hostname_localhost,
@@ -408,8 +408,8 @@ def create_flux_executor(
408408
max_cores: Optional[int] = None,
409409
cache_directory: Optional[str] = None,
410410
resource_dict: Optional[dict] = None,
411+
pmi_mode: Optional[str] = None,
411412
flux_executor=None,
412-
flux_executor_pmi_mode: Optional[str] = None,
413413
flux_executor_nesting: bool = False,
414414
flux_log_files: bool = False,
415415
hostname_localhost: Optional[bool] = None,
@@ -437,8 +437,8 @@ def create_flux_executor(
437437
compute notes. Defaults to False.
438438
- error_log_file (str): Name of the error log file to use for storing exceptions raised
439439
by the Python functions submitted to the Executor.
440+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
440441
flux_executor (flux.job.FluxExecutor): Flux Python interface to submit the workers to flux
441-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
442442
flux_executor_nesting (bool): Provide hierarchically nested Flux job scheduler inside the submitted function.
443443
flux_log_files (bool, optional): Write flux stdout and stderr files. Defaults to False.
444444
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
@@ -470,7 +470,7 @@ def create_flux_executor(
470470
resource_dict["hostname_localhost"] = hostname_localhost
471471
resource_dict["log_obj_size"] = log_obj_size
472472
check_init_function(block_allocation=block_allocation, init_function=init_function)
473-
check_pmi(backend="flux_allocation", pmi=flux_executor_pmi_mode)
473+
check_pmi(backend="flux_allocation", pmi=pmi_mode)
474474
check_oversubscribe(oversubscribe=resource_dict.get("openmpi_oversubscribe", False))
475475
check_command_line_argument_lst(
476476
command_line_argument_lst=resource_dict.get("slurm_cmd_args", [])
@@ -479,8 +479,8 @@ def create_flux_executor(
479479
del resource_dict["openmpi_oversubscribe"]
480480
if "slurm_cmd_args" in resource_dict:
481481
del resource_dict["slurm_cmd_args"]
482+
resource_dict["pmi_mode"] = pmi_mode
482483
resource_dict["flux_executor"] = flux_executor
483-
resource_dict["flux_executor_pmi_mode"] = flux_executor_pmi_mode
484484
resource_dict["flux_executor_nesting"] = flux_executor_nesting
485485
resource_dict["flux_log_files"] = flux_log_files
486486
if block_allocation:

executorlib/executor/single.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -329,7 +329,7 @@ def __init__(
329329
cache_directory=cache_directory,
330330
resource_dict=resource_dict,
331331
flux_executor=None,
332-
flux_executor_pmi_mode=None,
332+
pmi_mode=None,
333333
flux_executor_nesting=False,
334334
flux_log_files=False,
335335
pysqa_config_directory=None,

executorlib/executor/slurm.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ class SlurmClusterExecutor(BaseExecutor):
4444
- error_log_file (str): Name of the error log file to use for storing exceptions raised
4545
by the Python functions submitted to the Executor.
4646
pysqa_config_directory (str, optional): path to the pysqa config directory (only for pysqa based backend).
47+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
4748
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
4849
context of an HPC cluster this essential to be able to communicate to an
4950
Executor running on a different compute node within the same allocation. And
@@ -91,6 +92,7 @@ def __init__(
9192
max_cores: Optional[int] = None,
9293
resource_dict: Optional[dict] = None,
9394
pysqa_config_directory: Optional[str] = None,
95+
pmi_mode: Optional[str] = None,
9496
hostname_localhost: Optional[bool] = None,
9597
block_allocation: bool = False,
9698
init_function: Optional[Callable] = None,
@@ -125,6 +127,7 @@ def __init__(
125127
- error_log_file (str): Name of the error log file to use for storing exceptions
126128
raised by the Python functions submitted to the Executor.
127129
pysqa_config_directory (str, optional): path to the pysqa config directory (only for pysqa based backend).
130+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
128131
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
129132
context of an HPC cluster this essential to be able to communicate to an
130133
Executor running on a different compute node within the same allocation. And
@@ -173,8 +176,8 @@ def __init__(
173176
max_cores=max_cores,
174177
cache_directory=cache_directory,
175178
resource_dict=resource_dict,
179+
pmi_mode=pmi_mode,
176180
flux_executor=None,
177-
flux_executor_pmi_mode=None,
178181
flux_executor_nesting=False,
179182
flux_log_files=False,
180183
pysqa_config_directory=pysqa_config_directory,
@@ -232,6 +235,7 @@ class SlurmJobExecutor(BaseExecutor):
232235
compute notes. Defaults to False.
233236
- error_log_file (str): Name of the error log file to use for storing exceptions raised
234237
by the Python functions submitted to the Executor.
238+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
235239
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
236240
context of an HPC cluster this essential to be able to communicate to an
237241
Executor running on a different compute node within the same allocation. And
@@ -278,6 +282,7 @@ def __init__(
278282
cache_directory: Optional[str] = None,
279283
max_cores: Optional[int] = None,
280284
resource_dict: Optional[dict] = None,
285+
pmi_mode: Optional[str] = None,
281286
hostname_localhost: Optional[bool] = None,
282287
block_allocation: bool = False,
283288
init_function: Optional[Callable] = None,
@@ -315,6 +320,7 @@ def __init__(
315320
compute notes. Defaults to False.
316321
- error_log_file (str): Name of the error log file to use for storing exceptions
317322
raised by the Python functions submitted to the Executor.
323+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
318324
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
319325
context of an HPC cluster this essential to be able to communicate to an
320326
Executor running on a different compute node within the same allocation. And
@@ -356,6 +362,7 @@ def __init__(
356362
cache_directory=cache_directory,
357363
max_cores=max_cores,
358364
resource_dict=resource_dict,
365+
pmi_mode=pmi_mode,
359366
hostname_localhost=hostname_localhost,
360367
block_allocation=block_allocation,
361368
init_function=init_function,
@@ -376,6 +383,7 @@ def __init__(
376383
cache_directory=cache_directory,
377384
max_cores=max_cores,
378385
resource_dict=resource_dict,
386+
pmi_mode=pmi_mode,
379387
hostname_localhost=hostname_localhost,
380388
block_allocation=block_allocation,
381389
init_function=init_function,
@@ -389,6 +397,7 @@ def create_slurm_executor(
389397
max_cores: Optional[int] = None,
390398
cache_directory: Optional[str] = None,
391399
resource_dict: Optional[dict] = None,
400+
pmi_mode: Optional[str] = None,
392401
hostname_localhost: Optional[bool] = None,
393402
block_allocation: bool = False,
394403
init_function: Optional[Callable] = None,
@@ -418,6 +427,7 @@ def create_slurm_executor(
418427
compute notes. Defaults to False.
419428
- error_log_file (str): Name of the error log file to use for storing exceptions raised
420429
by the Python functions submitted to the Executor.
430+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None
421431
hostname_localhost (boolean): use localhost instead of the hostname to establish the zmq connection. In the
422432
context of an HPC cluster this essential to be able to communicate to an
423433
Executor running on a different compute node within the same allocation. And
@@ -441,6 +451,7 @@ def create_slurm_executor(
441451
resource_dict["cache_directory"] = cache_directory
442452
resource_dict["hostname_localhost"] = hostname_localhost
443453
resource_dict["log_obj_size"] = log_obj_size
454+
resource_dict["pmi_mode"] = pmi_mode
444455
check_init_function(block_allocation=block_allocation, init_function=init_function)
445456
if block_allocation:
446457
resource_dict["init_function"] = init_function

executorlib/standalone/command.py

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ def get_cache_execute_command(
2121
file_name: str,
2222
cores: int = 1,
2323
backend: Optional[str] = None,
24-
flux_executor_pmi_mode: Optional[str] = None,
24+
pmi_mode: Optional[str] = None,
2525
) -> list:
2626
"""
2727
Get command to call backend as a list of two strings
@@ -30,7 +30,7 @@ def get_cache_execute_command(
3030
file_name (str): The name of the file.
3131
cores (int, optional): Number of cores used to execute the task. Defaults to 1.
3232
backend (str, optional): name of the backend used to spawn tasks ["slurm", "flux"].
33-
flux_executor_pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
33+
pmi_mode (str): PMI interface to use (OpenMPI v5 requires pmix) default is None (Flux only)
3434
3535
Returns:
3636
list[str]: List of strings containing the python executable path and the backend script to execute
@@ -44,25 +44,21 @@ def get_cache_execute_command(
4444
+ [get_command_path(executable="cache_parallel.py"), file_name]
4545
)
4646
elif backend == "slurm":
47+
command_prepend = ["srun", "-n", str(cores)]
48+
if pmi_mode is not None:
49+
command_prepend += ["--mpi=" + pmi_mode]
4750
command_lst = (
48-
["srun", "-n", str(cores)]
51+
command_prepend
4952
+ command_lst
5053
+ [get_command_path(executable="cache_parallel.py"), file_name]
5154
)
5255
elif backend == "flux":
53-
if flux_executor_pmi_mode is not None:
54-
flux_command = [
55-
"flux",
56-
"run",
57-
"-o",
58-
"pmi=" + flux_executor_pmi_mode,
59-
"-n",
60-
str(cores),
61-
]
62-
else:
63-
flux_command = ["flux", "run", "-n", str(cores)]
56+
flux_command = ["flux", "run"]
57+
if pmi_mode is not None:
58+
flux_command += ["-o", "pmi=" + pmi_mode]
6459
command_lst = (
6560
flux_command
61+
+ ["-n", str(cores)]
6662
+ command_lst
6763
+ [get_command_path(executable="cache_parallel.py"), file_name]
6864
)

0 commit comments

Comments
 (0)