Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Upgrade ray to 2.9.3 to support python 3.11 #3248

Merged
merged 35 commits into from
Feb 29, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 28, 2024

This unblocks the case for the user using docker image or custom cloud image that has Python 3.11 in the base environment.

#2801 should be a more robust solution, but since the overhead it could introduce, we can get this PR in first.

TODO:

  • When cancelling a task, there is an error shown by ray in the job output:
    (sky-cmd, pid=4965) hi
    (raylet) WARNING: 8 PYTHON worker processes have been started on node: 22029db2cf6bf02eadaf84cdd402beaef9a1795321880974f43b18ca with address: 172.31.84.137. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
    *** SIGTERM received at time=1709110183 on cpu 0 ***
    PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
        @     0x7f22688f3140  (unknown)  (unknown)
        @ ... and at least 1 more frames
    [2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: *** SIGTERM received at time=1709110183 on cpu 0 ***
    [2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
    [2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @     0x7f22688f3140  (unknown)  (unknown)
    [2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @ ... and at least 1 more frames
    Connection to localhost closed.
    
  • backward compatibility

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-ray-2.9 --cloud aws --cpus 2 "python --version" --image-id docker:continuumio/miniconda3
    • sky exec test-ray-2.9 "for i in \`seq 1 140\`; do echo \$i; sleep 1; done"
    • for i in `seq 1 100`; do sky exec -d test-ray-2.9 "echo hi; sleep 10000; echo bye"; done
    • sky queue
    • sky launch -c test-mn --num-nodes 4 --cloud aws --cpus 2 --image-id docker:continuumio/miniconda3 echo \$SKYPILOT_NODE_RANK
    • sky launch -c test-azure --cloud azure --cpus 2 --num-nodes 4 "echo hi; sleep 1000"
    • sky launch -c test-fluid --cloud fluidstack --gpus V100 nvidia-smi
    • sky launch -c test-fluid --cloud fluidstack --gpus V100 --num-nodes 4 nvidia-smi
    • sky launch -c test-tqdm test.yaml
      resources:
        cpus: 2+
      
      setup: |
        pip install tqdm
      
      run: |
        python <<-EOF
        import tqdm, time
        for i in tqdm.tqdm(range(10000000)):
            time.sleep(0.5)
        EOF
  • All smoke tests: pytest tests/test_smoke.py
  • All smoke tests: pytest tests/test_smoke.py --aws
  • All smoke tests: pytest tests/test_smoke.py --kubernetes
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @Michaelvll, thanks for the comprehensive upgrades + testing! Some questions.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@@ -3139,7 +3135,8 @@ def _exec_code_on_head(
f'{cd} && ray job submit '
'--address=http://127.0.0.1:$RAY_DASHBOARD_PORT '
f'--submission-id {job_id}-$(whoami) --no-wait '
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')
# Redirect stderr to /dev/null to avoid distracting error from ray.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean

(sky-cmd, pid=4965) hi
(raylet) WARNING: 8 PYTHON worker processes have been started on node: 22029db2cf6bf02eadaf84cdd402beaef9a1795321880974f43b18ca with address: 172.31.84.137. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
*** SIGTERM received at time=1709110183 on cpu 0 ***
PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
    @     0x7f22688f3140  (unknown)  (unknown)
    @ ... and at least 1 more frames
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: *** SIGTERM received at time=1709110183 on cpu 0 ***
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: PC: @     0x7f22688eead8  (unknown)  pthread_cond_timedwait@@GLIBC_2.3.2
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @     0x7f22688f3140  (unknown)  (unknown)
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361:     @ ... and at least 1 more frames
Connection to localhost closed.

?

What if there's some genuine errors being printed to stderr?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are redirecting stderr to get rid of those message.

@@ -623,7 +620,6 @@ def add_epilogue(self) -> None:
'return code list:{colorama.Style.RESET_ALL}',
returncodes,
reason,
file=sys.stderr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What's the reason we previous printed these error logs to stderr but now to stdout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we redirect the output to stdout is that the latest ray will print out a bunch of useless output to stderr when the job is being cancelled, as shown in the PR description, we now redirect the stderr to /dev/null to get rid of those outputs.

I don't think we have a particular reason to have it print to stderr, but was just to print the error to stderr.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/setup_files/setup.py Show resolved Hide resolved
sky/skylet/constants.py Show resolved Hide resolved
sky/clouds/kubernetes.py Outdated Show resolved Hide resolved
Comment on lines 271 to 272
'RAY_LOG_TO_DRIVER_EVENT_LEVEL="FATAL" '
'RAY_BACKEND_LOG_LEVEL="FATAL" '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need comments somewhere in terms of why default would produce some undesirable outputs. (Which is odd; is this a latest Ray bug? Normally outputs should be useful at default logging level.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we redirect the stderr, this log level seems not needed. Just removed : )

sky/provision/instance_setup.py Show resolved Hide resolved
@@ -255,6 +255,7 @@ def __init__(
self.event_callback = event_callback
# Ignore type error due to a mypy bug.
# https://github.com/python/mypy/issues/3004
self._num_nodes = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why needed for this upgrade?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for the mypy issue. Removed the staled comments : )

@Michaelvll Michaelvll mentioned this pull request Feb 29, 2024
5 tasks
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sky/provision/instance_setup.py Outdated Show resolved Hide resolved
sky/skylet/constants.py Outdated Show resolved Hide resolved
f'skypilot-{_sky_version}*.whl)[{{cloud}}, remote]" && '
'echo "{sky_wheel_hash}" > ~/.sky/wheels/current_sky_wheel_hash || '
'exit 1; }; '
f'{RAY_STATUS} || '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a pointer to those comments too?

# NOTE: this will only work for the cluster with ray cluster on our latest
# ray port 6380, but those existing cluster launched before #1790 that has
# ray cluster on the default port 6379 will be upgraded and restarted.
f'{RAY_STATUS} || {{ pip3 list | grep "ray " | '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some comments here like, "We do this guard to avoid any Ray client-server version mismatch. Specifically: If existing ray cluster is an older version say 2.4, and we pip install new version say 2.9 wheels here, then subsequent sky exec (ray job submit) will have v2.9 vs. 2.4 mismatch, similarly this problem exists for sky status -r (ray status)."

@Michaelvll Michaelvll merged commit b31c2a5 into master Feb 29, 2024
19 checks passed
@Michaelvll Michaelvll deleted the update-ray-version-to-support-py311 branch February 29, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants