Skip to content

Conversation

@louie-tsai
Copy link

@louie-tsai louie-tsai commented May 29, 2024

Description

passwordless ssh connection is needed for multinodes runs.
In order to have successful mpi init over multiple nodes, changes in the PR are needed.

Related Issue

NA

Changes Made

  • The code follows the project's coding standards.
  • No Intel Internal IP is present within the changes.
  • The documentation has been updated to reflect any changes in functionality.

Validation

  • I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

@tylertitsworth tylertitsworth changed the base branch from main to tylertitsworth/fork-pr-target May 29, 2024 23:15
tylertitsworth

This comment was marked as outdated.

@tylertitsworth tylertitsworth changed the base branch from tylertitsworth/fork-pr-target to main May 29, 2024 23:26
@github-actions
Copy link

github-actions bot commented Jun 4, 2024

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Manifest Files

Copy link
Contributor

@tylertitsworth tylertitsworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run our pre-commit over your files before you commit, this will make the pre-commit check pass since it can't write to your fork.

Also, please see the linter errors introduced by the lint check.

If you're going to add some expected functionality, like ssh communication, you need to test that communication. Add a new test in pytorch/tests/tests.yaml

&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config

EXPOSE ${SSHD_PORT}
CMD ["/usr/sbin/sshd", "-D"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any command will override this docker run ipex-multinode:latest my-command, instead this should be run as either an entrypoint or better in ~/.bashrc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will certainly be overridden by another CMD command.
I'd go with ENTRYPOINT:

ENTRYPOINT service ssh start && bash

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed accordingly

Comment on lines 130 to 132
RUN chown intel /etc/ssh/ssh_host_ecdsa_key && \
chown intel /etc/ssh/ssh_host_rsa_key && \
chown intel /etc/ssh/ssh_host_ed25519_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the already-generated keys, you should generate random keys on startup with this script: https://github.com/intel/transfer-learning/blob/main/docker/hf_k8s/generate_ssh_keys.sh

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understand. will follow the instructions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed accordingly

Comment on lines 136 to 151
RUN echo "cp "$SSH_KEY_PATH"/id_rsa /home/intel/.ssh/" >> ~/.bashrc \
&& echo "cp "$SSH_KEY_PATH"/id_rsa.pub /home/intel/.ssh/" >> ~/.bashrc \
&& echo "chmod 600 /home/intel/.ssh/id_rsa" >> ~/.bashrc \
&& echo "chown intel /home/intel/.ssh/id_rsa" >> ~/.bashrc \
&& echo "chown intel /home/intel/.ssh/id_rsa.pub" >> ~/.bashrc \
&& echo "if [ ! -f /home/intel/.ssh/authorized_keys ]; then" >> ~/.bashrc \
&& echo " cat /home/intel/.ssh/id_rsa.pub > /home/intel/.ssh/authorized_keys" >> ~/.bashrc \
&& echo "fi" >> ~/.bashrc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could copy the existing config from root instead of re-writing this code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain more? thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp ~/.bashrc /home/intel/.bashrc && chown ... && chmod ...
Replace all instances of /home/intel with ~.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understand, but don't want to also bring some local changes on ~/.bashrc into docker instance. might just stay with appending new codes into bashrc file in the docker instance.

Comment on lines 74 to 75
args:
SSH_KEY_PATH: ${SSH_KEY_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you specify an arg, make sure it has a default value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed accordingly

&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config

EXPOSE ${SSHD_PORT}
CMD ["/usr/sbin/sshd", "-D"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will certainly be overridden by another CMD command.
I'd go with ENTRYPOINT:

ENTRYPOINT service ssh start && bash

ashahba
ashahba previously approved these changes Jun 7, 2024
Copy link
Contributor

@tylertitsworth tylertitsworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make all the checks green
  • Add a test to pytorch/tests/tests.yaml for SSH so we know it's working.

@louie-tsai
Copy link
Author

@tylertitsworth
fixed some pre-commit issues. please help to merge it if it looks good to you.

@tylertitsworth
Copy link
Contributor

@tylertitsworth
fixed some pre-commit issues. please help to merge it if it looks good to you.

You have an issue with the idp build of this container: https://github.com/intel/ai-containers/actions/runs/9474288343/job/26103601878#step:5:4409

And your lint is failing due to hadolint warnings.

@tylertitsworth
Copy link
Contributor

tylertitsworth commented Jun 12, 2024

@louie-tsai perhaps it would just be easier to get your feedback on #124 and merge that one.

Can you try testing with the image via GITHUB_RUN_NUMBER=557 docker compose pull multinode?
We can add a test to verify the ssh support and do some more to verify that the IDP openssl functions properly.

@louie-tsai
Copy link
Author

@tylertitsworth
fixed some pre-commit issues. please help to merge it if it looks good to you.

You have an issue with the idp build of this container: https://github.com/intel/ai-containers/actions/runs/9474288343/job/26103601878#step:5:4409
don't understand why we need to run apt-get inside conda environment. also didn't face the issue on our SPR machine.
"#66 ERROR: process "conda run -n idp /bin/bash -c apt-get update -y && apt-get install -y --no-install-recommends --fix-missing python3-dev gcc libgl1-mesa-glx libglib2.0-0 openssh-server net-tools virtualenv && rm /etc/ssh/ssh_host_key /etc/ssh/ssh_host_key.pub && apt-get clean && rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 100"
is it a testing environment issue?

And your lint is failing due to hadolint warnings.

@louie-tsai
Copy link
Author

@louie-tsai perhaps it would just be easier to get your feedback on #124 and merge that one.

Can you try testing with the image via GITHUB_RUN_NUMBER=557 docker compose pull multinode? We can add a test to verify the ssh support and do some more to verify that the IDP openssl functions properly.

I don't see ssh client key handling there. have you tested it on bare metal without k8s?

@tylertitsworth
Copy link
Contributor

tylertitsworth commented Jun 13, 2024

I don't see ssh client key handling there. have you tested it on bare metal without k8s?

@louie-tsai users can mount the key at runtime, that is the safest and intended way to communicate between nodes via docker.

@tylertitsworth
Copy link
Contributor

don't understand why we need to run apt-get inside conda environment. also didn't face the issue on our SPR machine.

This is the environment that is built when PACKAGE_OPTION=idp, it's one of the versions of this multinode image.

is it a testing environment issue?

No, the weekly testing builds pytorch fine. https://github.com/intel/ai-containers/actions/runs/9432405319/job/26043878040

@louie-tsai
Copy link
Author

louie-tsai commented Jun 13, 2024

@tylertitsworth
fixed some pre-commit issues. please help to merge it if it looks good to you.

You have an issue with the idp build of this container: https://github.com/intel/ai-containers/actions/runs/9474288343/job/26103601878#step:5:4409

don't understand why we need to run apt-get inside conda environment. also didn't face the issue on our SPR machine.
"#66 ERROR: process "conda run -n idp /bin/bash -c apt-get update -y && apt-get install -y --no-install-recommends --fix-missing python3-dev gcc libgl1-mesa-glx libglib2.0-0 openssh-server net-tools virtualenv && rm /etc/ssh/ssh_host___key /etc/ssh/ssh_host___key.pub && apt-get clean && rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 100"
is it a testing environment issue?

And your lint is failing due to hadolint warnings.

only saw below warning related to multinode session. Do we need to fix all warning?
There are warning for original Dockerfile. How do we pass hasolint for original Dockerfile?
"
Dockerfile:76 DL3006 warning: Always tag the version of an image explicitly
Dockerfile:78 DL3008 warning: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>
Dockerfile:122 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them.
Dockerfile:134 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them.
Dockerfile:153 DL3006 warning: Always tag the version of an image explicitly
"

@louie-tsai
Copy link
Author

I don't see ssh client key handling there. have you tested it on bare metal without k8s?

@louie-tsai users can mount the key at runtime, that is the safest and intended way to communicate between nodes via docker.

if they have 100 nodes, do they need to mount key 100 times?

@tylertitsworth
Copy link
Contributor

only saw below warning related to multinode session. Do we need to fix all warning? There are warning for original Dockerfile. How do we pass hasolint for original Dockerfile? " Dockerfile:76 DL3006 warning: Always tag the version of an image explicitly Dockerfile:78 DL3008 warning: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version> Dockerfile:122 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them. Dockerfile:134 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them. Dockerfile:153 DL3006 warning: Always tag the version of an image explicitly"

We ignore some warnings in https://github.com/intel/ai-containers/blob/main/.github/linters/.hadolint.yaml, the lint error clearly says you only need to fix the SC2027 errors, which has to do with quoting.

@tylertitsworth
Copy link
Contributor

if they have 100 nodes, do they need to mount key 100 times?

Yes, I would assume it would be automated from some blob storage. This is exactly how it works in k8s.

@louie-tsai
Copy link
Author

@tylertitsworth
fixed some pre-commit issues. please help to merge it if it looks good to you.

You have an issue with the idp build of this container: https://github.com/intel/ai-containers/actions/runs/9474288343/job/26103601878#step:5:4409

don't understand why we need to run apt-get inside conda environment. also didn't face the issue on our SPR machine. "#66 ERROR: process "conda run -n idp /bin/bash -c apt-get update -y && apt-get install -y --no-install-recommends --fix-missing python3-dev gcc libgl1-mesa-glx libglib2.0-0 openssh-server net-tools virtualenv && rm /etc/ssh/ssh_host___key /etc/ssh/ssh_host___key.pub && apt-get clean && rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 100" is it a testing environment issue?

And your lint is failing due to hadolint warnings.

only saw below warning related to multinode session. Do we need to fix all warning? There are warning for original Dockerfile. How do we pass hasolint for original Dockerfile? " Dockerfile:76 DL3006 warning: Always tag the version of an image explicitly Dockerfile:78 DL3008 warning: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version> Dockerfile:122 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them. Dockerfile:134 SC2027 warning: The surrounding quotes actually unquote this. Remove or escape them. Dockerfile:153 DL3006 warning: Always tag the version of an image explicitly "

fixed the idp build issue accordingly

@tylertitsworth
Copy link
Contributor

tylertitsworth commented Jun 13, 2024

@louie-tsai let's move this discussion to #124

@tylertitsworth tylertitsworth added the duplicate This issue or pull request already exists label Jun 18, 2024
@tylertitsworth tylertitsworth added Review and removed duplicate This issue or pull request already exists labels Jun 28, 2024
@tylertitsworth
Copy link
Contributor

Hi @louie-tsai, this PR has been open a while and might be stale. Do you want us to close the PR or keep it open for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants