added port argument for ssh#4117
Conversation
deepspeed/launcher/runner.py
Outdated
| else: | ||
| cmd = runner.get_cmd(env, active_resources) | ||
|
|
||
| env["PDSH_SSH_ARGS_APPEND"] += f" -p {args.port} " |
There was a problem hiding this comment.
Could this be moved to this part of the code?
There was a problem hiding this comment.
I agree w. this suggestion from @loadams. The deepspeed launcher supports different multi-node backends, pdsh is the default and most common but it would be good to move the PDSH_SSH_ARGS_APPEND variable into the PDSHRunner class inside multinode_runner.py.
There was a problem hiding this comment.
Thank you for your feedback. I've moved this part of the code into the PDSHRunner class as specified, and made the necessary changes.
deepspeed/launcher/runner.py
Outdated
| first_host = list(active_resources.keys())[0] | ||
| try: | ||
| subprocess.check_call(f'ssh -o PasswordAuthentication=no {first_host} hostname', | ||
| subprocess.check_call(f'ssh -o PasswordAuthentication=no -p {args.port} {first_host} hostname', |
There was a problem hiding this comment.
I think we only want to pass -p {args.port} iff --ssh_port is set. In many cases we have users setting up a ssh config that sets specific port/connection parameters. If we force passing -p 22 as the default then this will break a ssh config host that may have a port that is not 22.
There was a problem hiding this comment.
Thank you for your feedback. I've added an if statement to make sure the -p is only added on when args.ssh_port is specified.
|
@microsoft-github-policy-service agree |
|
Hi @Hiromasa-H - FYI, you'll have to update test_pdsh_runner as well since it needs the changes from here. |
deepspeed/launcher/runner.py
Outdated
| "numbers and range. i.e. 1,3-5,7 => [1,3,4,5,7]. When not " | ||
| "specified, all cores on system would be used rank binding") | ||
|
|
||
| parser.add_argument("--ssh_port", type=int, default=22, help="SSH port to use for remote connections") |
There was a problem hiding this comment.
I think the later checks for if args.ssh_port is not None will never trigger since the default here is 22. Can we set the default to None?
|
Hi @Hiromasa-H - apologies for the slow review, wanted to test one thing on our side first, but will review this today. Thanks for your patience. |
mrwyattii
left a comment
There was a problem hiding this comment.
Let's avoid copying code with subprocess.check_call (see my suggestion).
Otherwise, LGTM
|
@loadams Sorry, I did not intend to rush you. Thank you for your taking your time with a thorough review. |
This is the PR for issue #4116.