Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0.0] How to set ssh port for openmpi 5.0.0 ? #12090

Closed
SimZhou opened this issue Nov 17, 2023 · 11 comments
Closed

[5.0.0] How to set ssh port for openmpi 5.0.0 ? #12090

SimZhou opened this issue Nov 17, 2023 · 11 comments

Comments

@SimZhou
Copy link

SimZhou commented Nov 17, 2023

Background information

I am upgrading OpenMPI from 4.1.6 to 5.0.0 and found the -mca plm_rsh_args "-p 6666" has faild.

Details of the problem

for 4.1.6, this code works:

# cmd
mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 -mca orte_keep_fqdn_hostnames 1 -mca plm_rsh_args "-p 6666" -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170015108811505650188-yihua-zhou-master-0.job-170015108811505650188-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170015108811505650188-yihua-zhou-worker-0.job-170015108811505650188-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" :

# output
Warning: Permanently added '[job-170015108811505650188-yihua-zhou-worker-0.job-170015108811505650188-yihua-zhou]:6666,[10.255.105.100]:6666' (ECDSA) to the list of known hosts.
1 0 2
0 0 2

for 5.0.0, I see I can ignore the -mca orte_keep_fqdn_hostnames 1 param, however it is trying to communicate vir ssh port 22:

# cmd
mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 -mca plm_rsh_args "-p 6666" -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170018583494673244266-yihua-zhou-master-0.job-170018583494673244266-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170018583494673244266-yihua-zhou-worker-0.job-170018583494673244266-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" :

# err
ssh: connect to host job-170018583494673244266-yihua-zhou-worker-0.job-170018583494673244266-yihua-zhou port 22: Connection refused
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-job-170018583494673244266-yihua-zhou-master-0-60@0,0] on node job-170018583494673244266-yihua-zhou-master-0
  Remote daemon: [prterun-job-170018583494673244266-yihua-zhou-master-0-60@0,1] on node job-170018583494673244266-yihua-zhou-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
@rhc54
Copy link
Contributor

rhc54 commented Nov 17, 2023

Instead of -mca plm_rsh_args "-p 6666" use --prtemca plm_ssh_args "-p 6666" - autotranslation is missing this conversion.

@bosilca
Copy link
Member

bosilca commented Nov 17, 2023

You can also modify your .ssh/config file to specify a different port for connecting to your cluster resources.

@SimZhou
Copy link
Author

SimZhou commented Nov 19, 2023

You can also modify your .ssh/config file to specify a different port for connecting to your cluster resources.

I am using k8s for training, so that the hostname changes every time when I submit a job (but port remains the same).
In this scenario, how should I modify my .ssh/config ?

@SimZhou
Copy link
Author

SimZhou commented Nov 19, 2023

Instead of -mca plm_rsh_args "-p 6666" use --prtemca plm_ssh_args "-p 6666" - autotranslation is missing this conversion.

when I added --prtemca plm_ssh_args "-p 6666" , it reports:

unknown option -- -
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-E log_file] [-e escape_char]
           [-F configfile] [-I pkcs11] [-i identity_file]
           [-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
           [-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
           [-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
           [user@]hostname [command]
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-job-170041093482802585476-yihua-zhou-master-0-66@0,0] on node job-170041093482802585476-yihua-zhou-master-0
  Remote daemon: [prterun-job-170041093482802585476-yihua-zhou-master-0-66@0,1] on node job-170041093482802585476-yihua-zhou-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

my full command is:

mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 --prtemca plm_ssh_args "-p 6666" -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170041093482802585476-yihua-zhou-master-0.job-170041093482802585476-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170041093482802585476-yihua-zhou-worker-0.job-170041093482802585476-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : 

@rhc54
Copy link
Contributor

rhc54 commented Nov 19, 2023

Please add --prtemca plm_base_verbose 5 so we can see the full ssh cmd line.

@bosilca
Copy link
Member

bosilca commented Nov 19, 2023

You can also modify your .ssh/config file to specify a different port for connecting to your cluster resources.

I am using k8s for training, so that the hostname changes every time when I submit a job (but port remains the same). In this scenario, how should I modify my .ssh/config ?

ssh config supports regular expressions in the hostname (including wildcard), giving you some flexibility on matching patterns. Try something like this in the ${HOME}/.ssh/config of the head node of your platform.

Host *-yihua-zhou
  Port 6666

@SimZhou
Copy link
Author

SimZhou commented Nov 19, 2023

Please add --prtemca plm_base_verbose 5 so we can see the full ssh cmd line.

it seems the cmdline parser has added an extra '-' so that '-p 6666' is parsed into '--p 6666':

# cmd
mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 --prtemca plm_ssh_args "-p 6666" --prtemca plm_base_verbose 5 -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170041402698701665080-yihua-zhou-master-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : 

# err msg
[job-170041402698701665080-yihua-zhou-master-0:00049] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:receive start comm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm creating map
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] setup:vm: working unmanaged allocation
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] using dash_host
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] checking node job-170041402698701665080-yihua-zhou-master-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] ignoring myself
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] checking node job-170041402698701665080-yihua-zhou-worker-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm add new daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:setup_vm assigning new daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1] to node job-170041402698701665080-yihua-zhou-worker-0
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: launching vm
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: local shell: 0 (bash)
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: assuming same remote shell as local shell
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: remote shell: 0 (bash)
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: final template argv:
        /usr/bin/ssh --p 6666 <template> PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32" --prtemca plm_ssh_args "--p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32"
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh:launch daemon 0 not a child of mine
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: adding node job-170041402698701665080-yihua-zhou-worker-0 to launch list
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: activating launch event
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: recording launch of daemon [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh --p 6666 job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "2" --prtemca prte_hnp_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32" --prtemca plm_ssh_args "--p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041402698701665080-yihua-zhou-master-0-49@0.0;tcp://10.248.155.127:43189:32"]
unknown option -- -
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-E log_file] [-e escape_char]
           [-F configfile] [-I pkcs11] [-i identity_file]
           [-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
           [-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
           [-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
           [user@]hostname [command]
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] daemon 1 failed with status 255
[job-170041402698701665080-yihua-zhou-master-0:00049] [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] plm:base:receive stop comm
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,0] on node job-170041402698701665080-yihua-zhou-master-0
  Remote daemon: [prterun-job-170041402698701665080-yihua-zhou-master-0-49@0,1] on node job-170041402698701665080-yihua-zhou-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

And if I remove the -, adding only 'p 6666' then it is not adding any -:

# cmd
mpirun -bind-to none -map-by slot -oversubscribe -mca pml ob1 --prtemca plm_ssh_args "p 6666" --prtemca plm_base_verbose 5 -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,65536,32 \
-H job-170041402698701665080-yihua-zhou-master-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : \
-H job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou:1 -x NCCL_P2P_DISABLE=1 -x NCCL_DEBUG=INFO -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -np 1 \
python3 -c "import horovod.torch as hvd; hvd.init(); print(hvd.rank(), hvd.local_rank(), hvd.size())" : 

# err_msg
[job-170041437185078023828-yihua-zhou-master-0:00040] [[INVALID],0] plm:ssh_lookup on agent ssh : rsh path NULL
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh_setup on agent ssh : rsh path NULL
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:receive start comm
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm creating map
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] setup:vm: working unmanaged allocation
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] using dash_host
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] checking node job-170041402698701665080-yihua-zhou-master-0
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] checking node job-170041402698701665080-yihua-zhou-worker-0
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm add new daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,1]
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm assigning new daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,1] to node job-170041402698701665080-yihua-zhou-master-0
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm add new daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,2]
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:setup_vm assigning new daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,2] to node job-170041402698701665080-yihua-zhou-worker-0
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: launching vm
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: local shell: 0 (bash)
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: assuming same remote shell as local shell
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: remote shell: 0 (bash)
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: final template argv:
        /usr/bin/ssh p 6666 <template> PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32" --prtemca plm_ssh_args "p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32"
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh:launch daemon 0 not a child of mine
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: adding node job-170041402698701665080-yihua-zhou-master-0 to launch list
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: adding node job-170041402698701665080-yihua-zhou-worker-0 to launch list
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: activating launch event
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: recording launch of daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,1]
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: recording launch of daemon [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,2]
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh p 6666 job-170041402698701665080-yihua-zhou-master-0.job-170041402698701665080-yihua-zhou PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0" --prtemca ess_base_vpid 1 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32" --prtemca plm_ssh_args "p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32"]
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh p 6666 job-170041402698701665080-yihua-zhou-worker-0.job-170041402698701665080-yihua-zhou PRTE_PREFIX=/usr/local;export PRTE_PREFIX;LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/usr/local/bin/prted --prtemca ess "env" --prtemca ess_base_nspace "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0" --prtemca ess_base_vpid 2 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32" --prtemca plm_ssh_args "p 6666" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh" --tree-spawn --prtemca prte_parent_uri "prterun-job-170041437185078023828-yihua-zhou-master-0-40@0.0;tcp://10.248.155.126:37997:32"]
ssh: Could not resolve hostname p: Name or service not known
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] daemon 2 failed with status 255
[job-170041437185078023828-yihua-zhou-master-0:00040] [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] plm:base:receive stop comm
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,0] on node job-170041437185078023828-yihua-zhou-master-0
  Remote daemon: [prterun-job-170041437185078023828-yihua-zhou-master-0-40@0,2] on node job-170041402698701665080-yihua-zhou-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

@SimZhou
Copy link
Author

SimZhou commented Nov 19, 2023

it seems ssh takes no -- as its parameters

$ ssh --port
unknown option -- -
usage: ssh [-46AaCfGgKkMNnqsTtVvXxYy] [-b bind_address] [-c cipher_spec]
           [-D [bind_address:]port] [-E log_file] [-e escape_char]
           [-F configfile] [-I pkcs11] [-i identity_file]
           [-J [user@]host[:port]] [-L address] [-l login_name] [-m mac_spec]
           [-O ctl_cmd] [-o option] [-p port] [-Q query_option] [-R address]
           [-S ctl_path] [-W host:port] [-w local_tun[:remote_tun]]
           [user@]hostname [command]

@rhc54
Copy link
Contributor

rhc54 commented Nov 19, 2023

Right - this is a bug in the cmd line parser. I'll take a look at it.

@rhc54
Copy link
Contributor

rhc54 commented Nov 19, 2023

Fixed in openpmix/prrte#1859

@janjust
Copy link
Contributor

janjust commented Jan 18, 2024

Fixed, closing

@janjust janjust closed this as completed Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants