Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docker] (Kubeflow integration) Add chmod --recursive 777 /home/ray to Ray Dockerfile. #31563

Merged

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 10, 2023

Signed-off-by: kaihsun kaihsun@anyscale.com

Why are these changes needed?

This issue is a part of the integration between KubeRay and Kubeflow. See ray-project/kuberay#750 (comment) for some context.

In KinD (a Kubernetes distribution) cluster, when we use the command kubectl exec .. to log in to a ray Pod, the UID will be the same as $RAY_UID (i.e. 1000) in base-deps/Dockerfile.

ARG RAY_UID=1000
ARG RAY_GID=100
RUN apt-get update -y \
&& apt-get install -y sudo tzdata \
&& useradd -ms /bin/bash -d /home/ray ray --uid $RAY_UID --gid $RAY_GID \
&& usermod -aG sudo ray \
&& echo 'ray ALL=NOPASSWD: ALL' >> /etc/sudoers \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
USER $RAY_UID

For OpenShift (a Kubernetes distribution), a random non-root UID will be used when we log in to a ray Pod. However, only UID=1000 has the write access of /home/ray. Therefore, the error message of Permission denied will be reported. As follows, only ray (UID = 1000) has rwx (read, write, execute) access to /home/ray. Others only have r-x (read & execute) access to /home/ray.

> ls -l /home/
drwxr-xr-x 1 ray users 4096 Dec  7 17:18 ray

To reproduce it, we can follow instructions in pod-security.md, and add runAsUser and runAsGroup to the securityContext of ray-head in ray-cluster.pod-security.yaml.

 securityContext:
    runAsUser: 1001190000
    runAsGroup: 0
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault 

After the RayCluster is ready, use kubectl exec ... to log in to the head Pod. Next, execute the following commands.

(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ id
uid=1001190000 gid=0(root) groups=0(root)
(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ pwd
/home/ray
(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ touch 123
touch: cannot touch '123': Permission denied

Related issue number

Closes #30959

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Test

  • Step 1: Build Docker images.
# Download wheels
export RAY_HASH=f1b8bfd219a98da69a612558cbae32441d98ca60
export RAY_VERSION=2.2.0
./release/util/download_wheels.sh # Modify this file, and comment out unused wheels.

# Move wheels to .whl/
mv *.whl .whl

# I do not know what is this, but "build-docker-images.py" asks me to run this command.
pip download --python-version 3.8 ray==2.2.0 --only-binary=:all:

# Build Docker images
python ./ci/build/build-docker-images.py --py-versions py37 --device-types cpu --build-type HUMAN --build-base
  • Step 2: Update Docker images in ray-cluster.pod-security.yaml to point to the images built in Step 1. Then, load the images into the Kind cluster and follow the instructions in the section "Why are these changes needed?".

Screen Shot 2023-01-11 at 1 18 42 AM

@kevin85421
Copy link
Member Author

Hi @ijrsvt,

Would you mind giving me some pointers?

Q1: I hoped to build a docker image to test this PR, and I ran the following command.

➜  ray git:(kubeflow-openshift-docker) python ./ci/build/build-docker-images.py --py-versions py37 --device-types cpu --build-type LOCAL --build-base
Building the following python versions:  ['3.7']
Building images for the following devices:  ['cpu']
Building base images:  True
Cleaned up 0.0MB
INVALID SHA FOUND
BUILT:  rayproject/base-deps:nightly-py37-cpu
Traceback (most recent call last):
  File "./ci/build/build-docker-images.py", line 721, in <module>
    py_versions, image_types, args.base
  File "./ci/build/build-docker-images.py", line 345, in build_or_pull_base_images
    build_base_images(py_versions, image_types)
  File "./ci/build/build-docker-images.py", line 319, in build_base_images
    build_for_all_versions("ray-deps", py_versions, image_types, no_cache=False)
  File "./ci/build/build-docker-images.py", line 313, in build_for_all_versions
    image_name, py_version=py_version, image_type=image_type, **kwargs
  File "./ci/build/build-docker-images.py", line 211, in _build_docker_image
    wheel = _get_wheel_name(build_args["PYTHON_MINOR_VERSION"])
  File "./ci/build/build-docker-images.py", line 134, in _get_wheel_name
    f"Found ({len(matches)}) matches for 'ray-*{PYTHON_WHL_VERSION}"
AssertionError: Found (0) matches for 'ray-*cp37*-manylinux*' instead of 1.
wheel matches: []

However, there is no .whl directory in the root directory of the ray repository.

Q2: The script build-docker-images.py seems to build a lot of images in one command. Can I build only 1 Docker image for testing?

Thank you!

@kevin85421
Copy link
Member Author

cc @DmitriGekhtman @juliusvonkohout This PR is not ready to review. I will request reviews when it is ready. Thanks!

@ijrsvt
Copy link
Contributor

ijrsvt commented Jan 11, 2023

Try --build-type=HUMAN

I'm not sure if there is a way to only build one version :/

@kevin85421 kevin85421 changed the title (WIP) [Docker] (Kubeflow integration) Only UID=1000 has the write access of /home/ray in Ray images [Docker] (Kubeflow integration) Only UID=1000 has the write access of /home/ray in Ray images Jan 11, 2023
@kevin85421 kevin85421 marked this pull request as ready for review January 11, 2023 09:19
@kevin85421
Copy link
Member Author

Try --build-type=HUMAN

I'm not sure if there is a way to only build one version :/

Thank @ijrsvt!

I built Docker images on my devbox successfully by the following instructions:

# Download wheels
export RAY_HASH=f1b8bfd219a98da69a612558cbae32441d98ca60
export RAY_VERSION=2.2.0
./release/util/download_wheels.sh # Modify this file, and comment out unused wheels.

# Move wheels to .whl/
mv *.whl .whl

# I do not know what is this, but "build-docker-images.py" asks me to run this command.
pip download --python-version 3.8 ray==2.2.0 --only-binary=:all:

# Build Docker images
python ./ci/build/build-docker-images.py --py-versions py37 --device-types cpu --build-type HUMAN --build-base

# STDOUT
Building the following python versions:  ['3.7']
Building images for the following devices:  ['cpu']
Building base images:  True
Provide a 'branch name'. For releases, it should be `releases/x.x.x`
Provide a SHA (used for tag value)
Please download images using:
`pip download --python-version <py_version> ray==<ray_version>
Cleaned up 0.0MB
BUILT:  rayproject/base-deps:nightly-py37-cpu
Cleaned up 0.0MB
BUILT:  rayproject/ray-deps:nightly-py37-cpu
Cleaned up 985.0938501358032MB
BUILT:  rayproject/ray:nightly-py37-cpu
python/requirements.txt
python/requirements_test.txt
python/requirements_linters.txt
python/requirements/compat/requirements_legacy_compat.txt
python/requirements/compat/requirements_py36_compat.txt
python/requirements/data_processing/requirements.txt
python/requirements/data_processing/requirements_dataset.txt
python/requirements/ml/requirements_ml_docker.txt
python/requirements/ml/requirements_tune.txt
python/requirements/ml/requirements_train.txt
python/requirements/ml/requirements_rllib.txt
python/requirements/ml/requirements_upstream.txt
python/requirements/ml/requirements_dl.txt
python/ray/tune/requirements-dev.txt
python/ray/tests/project_files/requirements_project/requirements.txt
python/ray/tests/project_files/session-tests/invalid-config-fail/ray-project/requirements.txt
python/ray/tests/project_files/session-tests/git-repo-pass/ray-project/requirements.txt
python/ray/tests/project_files/session-tests/project-pass/ray-project/requirements.txt
python/ray/tests/project_files/project1/requirements.txt
python/ray/util/collective/requirements.txt
Cleaned up 130.6116771697998MB
Still building rayproject/ray-ml:nightly-py37-cpu after 300 seconds
Still building rayproject/ray-ml:nightly-py37-cpu after 603 seconds
BUILT:  rayproject/ray-ml:nightly-py37-cpu

@kevin85421
Copy link
Member Author

@juliusvonkohout would you mind reviewing this PR? I cannot send a review request to you via GitHub.

@juliusvonkohout
Copy link

I think chmod --recursive 777 /home/ray might be better to really be sure that it affects all subfolders

docker/base-deps/Dockerfile Outdated Show resolved Hide resolved
@kevin85421
Copy link
Member Author

owever, only UID=1000 has the write access of /home/ray

@juliusvonkohout Good catch! I have updated the PR.

  • chmod... -> chmod --recursive ...
  • Originally, chmod is executed before the directory /home/ray/anaconda3 creation. Hence, its permission is not 777. To solve this, I move the chmod command to the end of the Dockerfile.

Screen Shot 2023-01-11 at 7 32 12 AM

@kevin85421 kevin85421 force-pushed the kubeflow-openshift-docker branch from 579cd33 to 8a2ea31 Compare January 11, 2023 15:48
@juliusvonkohout
Copy link

Yes, that should be fine since you do almost everything in this single RUN command. Just for the future you could easily squash the layers of the OCI image instead of writing everything in a single command. But that is another topic.

docker/base-deps/Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comment update, but otherwise LGTM

@kevin85421
Copy link
Member Author

Is this PR ready to be merged? cc @ijrsvt @DmitriGekhtman

@DmitriGekhtman
Copy link
Contributor

Looks good to me, but check the failed py39 docker-build step.

@kevin85421
Copy link
Member Author

Looks good to me, but check the failed py39 docker-build step.

I cannot connect to buildkite now. I will check the error message when it becomes available.

@juliusvonkohout
Copy link

juliusvonkohout commented Jan 16, 2023

maybe a rebase to master is needed.

kevin85421 and others added 3 commits January 17, 2023 16:42
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Signed-off-by: kaihsun <kaihsun@anyscale.com>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@kevin85421 kevin85421 force-pushed the kubeflow-openshift-docker branch from 6afae4b to 24ac070 Compare January 17, 2023 16:44
@kevin85421
Copy link
Member Author

maybe a rebase to master is needed.

The failed py39 docker-build step is fixed after I rebased to master!

@DmitriGekhtman DmitriGekhtman merged commit 1a8f330 into ray-project:master Jan 18, 2023
@DmitriGekhtman DmitriGekhtman changed the title [Docker] (Kubeflow integration) Only UID=1000 has the write access of /home/ray in Ray images [Docker] (Kubeflow integration) Add chmod --recursive 777 /home/ray to Ray Dockerfile. Jan 18, 2023
@DmitriGekhtman
Copy link
Contributor

I think PR titles should indicate the change that was made rather than the problem that was solved -- I've fixed the name for this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Docker] (Kubeflow integration) Only UID=1000 has the write access of /home/ray in Ray images
4 participants