-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[5.0.0] How to set ssh port for openmpi 5.0.0 ? #12090
Comments
Instead of |
You can also modify your |
I am using k8s for training, so that the hostname changes every time when I submit a job (but port remains the same). |
when I added
my full command is: mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 --prtemca plm_ssh_args "-p 6666" -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170041093482802585476-yihua-zhou-master-0.job-170041093482802585476-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170041093482802585476-yihua-zhou-worker-0.job-170041093482802585476-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : |
Please add |
ssh config supports regular expressions in the hostname (including wildcard), giving you some flexibility on matching patterns. Try something like this in the
|
it seems the cmdline parser has added an extra '-' so that '-p 6666' is parsed into '--p 6666': # cmd
mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 --prtemca plm_ssh_args "-p 6666" --prtemca plm_base_verbose 5 -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170041402698701665080-yihua-zhou-master-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" :
# err msg
[job-170041402698701665080-yihua-zhou-master-0:00049] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:receive start comm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm creating map
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] setup:vm: working unmanaged allocation
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] using dash_host
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] checking node job-170041402698701665080-yihua-zhou-master-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] ignoring myself
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] checking node job-170041402698701665080-yihua-zhou-worker-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm add new daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm assigning new daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1] to node job-170041402698701665080-yihua-zhou-worker-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: launching vm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: local shell: 0 (bash)
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: assuming same remote shell as local shell
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: remote shell: 0 (bash)
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: final template argv:
/usr/bin/ssh --p 6666 <template> PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32" --prtemca plm_ssh_args "--p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32"
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh:launch daemon 0 not a child of mine
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: adding node job-170041402698701665080-yihua-zhou-worker-0 to launch list
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: activating launch event
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: recording launch of daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh --p 6666 job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32" --prtemca plm_ssh_args "--p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32"]
unknown option -- -
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
[-D [bind_address:]port] [-E log_file] [-e escape_char]
[-F configfile] [-I pkcs11] [-i identity_file]
[-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
[-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
[-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
[user@]hostname [command]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] daemon 1 failed with status 255
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:receive stop comm
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.
HNP daemon : [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] on node job-170041402698701665080-yihua-zhou-master-0
Remote daemon: [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1] on node job-170041402698701665080-yihua-zhou-worker-0
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
-------------------------------------------------------------------------- And if I remove the
|
it seems ssh takes no $ ssh --port
unknown option -- -
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
[-D [bind_address:]port] [-E log_file] [-e escape_char]
[-F configfile] [-I pkcs11] [-i identity_file]
[-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
[-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
[-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
[user@]hostname [command] |
Right - this is a bug in the cmd line parser. I'll take a look at it. |
Fixed in openpmix/prrte#1859 |
Fixed, closing |
Background information
I am upgrading OpenMPI from 4.1.6 to 5.0.0 and found the
-mca plm_rsh_args "-p 6666"
has faild.Details of the problem
for 4.1.6, this code works:
for 5.0.0, I see I can ignore the
-mca orte_keep_fqdn_hostnames 1
param, however it is trying to communicate vir ssh port 22:The text was updated successfully, but these errors were encountered: