-
Notifications
You must be signed in to change notification settings - Fork 29
Enable Passwordless ssh connection for Multinodes docker #61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Manifest Files |
7dcd1c5 to
92b7e3c
Compare
tylertitsworth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run our pre-commit over your files before you commit, this will make the pre-commit check pass since it can't write to your fork.
Also, please see the linter errors introduced by the lint check.
If you're going to add some expected functionality, like ssh communication, you need to test that communication. Add a new test in pytorch/tests/tests.yaml
pytorch/Dockerfile
Outdated
| && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config | ||
|
|
||
| EXPOSE ${SSHD_PORT} | ||
| CMD ["/usr/sbin/sshd", "-D"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any command will override this docker run ipex-multinode:latest my-command, instead this should be run as either an entrypoint or better in ~/.bashrc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will certainly be overridden by another CMD command.
I'd go with ENTRYPOINT:
ENTRYPOINT service ssh start && bash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed accordingly
pytorch/Dockerfile
Outdated
| RUN chown intel /etc/ssh/ssh_host_ecdsa_key && \ | ||
| chown intel /etc/ssh/ssh_host_rsa_key && \ | ||
| chown intel /etc/ssh/ssh_host_ed25519_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the already-generated keys, you should generate random keys on startup with this script: https://github.com/intel/transfer-learning/blob/main/docker/hf_k8s/generate_ssh_keys.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understand. will follow the instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed accordingly
pytorch/Dockerfile
Outdated
| RUN echo "cp "$SSH_KEY_PATH"/id_rsa /home/intel/.ssh/" >> ~/.bashrc \ | ||
| && echo "cp "$SSH_KEY_PATH"/id_rsa.pub /home/intel/.ssh/" >> ~/.bashrc \ | ||
| && echo "chmod 600 /home/intel/.ssh/id_rsa" >> ~/.bashrc \ | ||
| && echo "chown intel /home/intel/.ssh/id_rsa" >> ~/.bashrc \ | ||
| && echo "chown intel /home/intel/.ssh/id_rsa.pub" >> ~/.bashrc \ | ||
| && echo "if [ ! -f /home/intel/.ssh/authorized_keys ]; then" >> ~/.bashrc \ | ||
| && echo " cat /home/intel/.ssh/id_rsa.pub > /home/intel/.ssh/authorized_keys" >> ~/.bashrc \ | ||
| && echo "fi" >> ~/.bashrc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could copy the existing config from root instead of re-writing this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain more? thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cp ~/.bashrc /home/intel/.bashrc && chown ... && chmod ...
Replace all instances of /home/intel with ~.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
understand, but don't want to also bring some local changes on ~/.bashrc into docker instance. might just stay with appending new codes into bashrc file in the docker instance.
pytorch/docker-compose.yaml
Outdated
| args: | ||
| SSH_KEY_PATH: ${SSH_KEY_PATH} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you specify an arg, make sure it has a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed accordingly
pytorch/Dockerfile
Outdated
| && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config | ||
|
|
||
| EXPOSE ${SSHD_PORT} | ||
| CMD ["/usr/sbin/sshd", "-D"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will certainly be overridden by another CMD command.
I'd go with ENTRYPOINT:
ENTRYPOINT service ssh start && bash
tylertitsworth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Make all the checks green
- Add a test to
pytorch/tests/tests.yamlfor SSH so we know it's working.
0997dc6 to
598df1b
Compare
9bf98c2 to
0997dc6
Compare
020af92 to
5fe8477
Compare
|
@tylertitsworth |
You have an issue with the idp build of this container: https://github.com/intel/ai-containers/actions/runs/9474288343/job/26103601878#step:5:4409 And your lint is failing due to hadolint warnings. |
|
@louie-tsai perhaps it would just be easier to get your feedback on #124 and merge that one. Can you try testing with the image via |
|
I don't see ssh client key handling there. have you tested it on bare metal without k8s? |
@louie-tsai users can mount the key at runtime, that is the safest and intended way to communicate between nodes via docker. |
This is the environment that is built when
No, the weekly testing builds pytorch fine. https://github.com/intel/ai-containers/actions/runs/9432405319/job/26043878040 |
don't understand why we need to run apt-get inside conda environment. also didn't face the issue on our SPR machine.
only saw below warning related to multinode session. Do we need to fix all warning? |
if they have 100 nodes, do they need to mount key 100 times? |
We ignore some warnings in https://github.com/intel/ai-containers/blob/main/.github/linters/.hadolint.yaml, the lint error clearly says you only need to fix the |
Yes, I would assume it would be automated from some blob storage. This is exactly how it works in k8s. |
fixed the idp build issue accordingly |
|
@louie-tsai let's move this discussion to #124 |
|
Hi @louie-tsai, this PR has been open a while and might be stale. Do you want us to close the PR or keep it open for you? |
Description
passwordless ssh connection is needed for multinodes runs.
In order to have successful mpi init over multiple nodes, changes in the PR are needed.
Related Issue
NA
Changes Made
Validation
test_runner.pywith all existing tests passing, and I have added new tests where applicable.