-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Upgrade ray to 2.9.3 to support python 3.11 #3248
Conversation
…-ray-version-to-support-py311
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @Michaelvll, thanks for the comprehensive upgrades + testing! Some questions.
@@ -3139,7 +3135,8 @@ def _exec_code_on_head( | |||
f'{cd} && ray job submit ' | |||
'--address=http://127.0.0.1:$RAY_DASHBOARD_PORT ' | |||
f'--submission-id {job_id}-$(whoami) --no-wait ' | |||
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"') | |||
# Redirect stderr to /dev/null to avoid distracting error from ray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean
(sky-cmd, pid=4965) hi
(raylet) WARNING: 8 PYTHON worker processes have been started on node: 22029db2cf6bf02eadaf84cdd402beaef9a1795321880974f43b18ca with address: 172.31.84.137. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
*** SIGTERM received at time=1709110183 on cpu 0 ***
PC: @ 0x7f22688eead8 (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
@ 0x7f22688f3140 (unknown) (unknown)
@ ... and at least 1 more frames
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: *** SIGTERM received at time=1709110183 on cpu 0 ***
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: PC: @ 0x7f22688eead8 (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: @ 0x7f22688f3140 (unknown) (unknown)
[2024-02-28 08:49:43,497 E 4932 4932] logging.cc:361: @ ... and at least 1 more frames
Connection to localhost closed.
?
What if there's some genuine errors being printed to stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we are redirecting stderr to get rid of those message.
@@ -623,7 +620,6 @@ def add_epilogue(self) -> None: | |||
'return code list:{colorama.Style.RESET_ALL}', | |||
returncodes, | |||
reason, | |||
file=sys.stderr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: What's the reason we previous printed these error logs to stderr but now to stdout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we redirect the output to stdout is that the latest ray will print out a bunch of useless output to stderr when the job is being cancelled, as shown in the PR description, we now redirect the stderr
to /dev/null
to get rid of those outputs.
I don't think we have a particular reason to have it print to stderr, but was just to print the error to stderr.
sky/provision/instance_setup.py
Outdated
'RAY_LOG_TO_DRIVER_EVENT_LEVEL="FATAL" ' | ||
'RAY_BACKEND_LOG_LEVEL="FATAL" ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need comments somewhere in terms of why default would produce some undesirable outputs. (Which is odd; is this a latest Ray bug? Normally outputs should be useful at default logging level.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we redirect the stderr, this log level seems not needed. Just removed : )
@@ -255,6 +255,7 @@ def __init__( | |||
self.event_callback = event_callback | |||
# Ignore type error due to a mypy bug. | |||
# https://github.com/python/mypy/issues/3004 | |||
self._num_nodes = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why needed for this upgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for the mypy issue. Removed the staled comments : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sky/skylet/constants.py
Outdated
f'skypilot-{_sky_version}*.whl)[{{cloud}}, remote]" && ' | ||
'echo "{sky_wheel_hash}" > ~/.sky/wheels/current_sky_wheel_hash || ' | ||
'exit 1; }; ' | ||
f'{RAY_STATUS} || ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a pointer to those comments too?
sky/skylet/constants.py
Outdated
# NOTE: this will only work for the cluster with ray cluster on our latest | ||
# ray port 6380, but those existing cluster launched before #1790 that has | ||
# ray cluster on the default port 6379 will be upgraded and restarted. | ||
f'{RAY_STATUS} || {{ pip3 list | grep "ray " | ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some comments here like, "We do this guard to avoid any Ray client-server version mismatch. Specifically: If existing ray cluster is an older version say 2.4, and we pip install new version say 2.9 wheels here, then subsequent sky exec
(ray job submit) will have v2.9 vs. 2.4 mismatch, similarly this problem exists for sky status -r
(ray status)."
This unblocks the case for the user using docker image or custom cloud image that has Python 3.11 in the base environment.
#2801 should be a more robust solution, but since the overhead it could introduce, we can get this PR in first.
TODO:
Tested (run the relevant ones):
bash format.sh
sky launch -c test-ray-2.9 --cloud aws --cpus 2 "python --version" --image-id docker:continuumio/miniconda3
sky exec test-ray-2.9 "for i in \`seq 1 140\`; do echo \$i; sleep 1; done"
for i in `seq 1 100`; do sky exec -d test-ray-2.9 "echo hi; sleep 10000; echo bye"; done
sky queue
sky launch -c test-mn --num-nodes 4 --cloud aws --cpus 2 --image-id docker:continuumio/miniconda3 echo \$SKYPILOT_NODE_RANK
sky launch -c test-azure --cloud azure --cpus 2 --num-nodes 4 "echo hi; sleep 1000"
sky launch -c test-fluid --cloud fluidstack --gpus V100 nvidia-smi
sky launch -c test-fluid --cloud fluidstack --gpus V100 --num-nodes 4 nvidia-smi
sky launch -c test-tqdm test.yaml
pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py --kubernetes
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh