Speed up refresh: delay the slower `ray status` call & use cached IPs. #2079

concretevitamin · 2023-06-13T04:44:55Z

I noticed that status -r is taking way too long. With some investigation with @Michaelvll, this PR Implemented two optimizations:

delay the slower ray status call
use cached IPs

All tests below are 1-node on-demand GCP clusters. These are all "normal" case where the runtime on the cluster did not become problematic.

TODO: we should keep speeding up refresh in the future.
- identified handle.external_ips(use_cached_ips=False) -> ray get head-ip/worker-ips is too slow; replace with NodeProvider calls or cloud CLI/SDK calls?

Results:

do ray status last + use_cached_ips=True

STOPPED -> STOPPED

before: 31s
now: 3.7s

UP, autostop not set -> UP

before: 8.9s
now: 7.1s

UP, autostopped -> STOPPED

before: 22.2s
now: 4.2s

UP, autostop set -> UP

before: 7.6s
now: 7.3s

INIT -> INIT

before: 25.7s
now: 22.3s

(if we do the first optimization only) do ray status last

STOPPED -> STOPPED

before: 31s
now: 3.9s

UP, autostop not set -> UP

before: 8.9s
now: 10.9s

UP, autostopped -> STOPPED

before: 22.2s
now: 6s

UP, autostop set -> UP

before: 7.6s
now: 11.4s

INIT -> INIT

before: 25.7s
now: 22.4s

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below): above
All smoke tests: pytest tests/test_smoke.py : pytest tests/test_smoke.py --generic-cloud aws
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Michaelvll

Thanks for the optimization and benchmark @concretevitamin! The changes look good to me.

Michaelvll · 2023-06-13T16:49:53Z

sky/utils/cli_utils/status_utils.py

+            # accelerator_args is way too long.
+            # Convert from:
+            #  GCP(n1-highmem-8, {'tpu-v2-8': 1}, accelerator_args={'runtime_version': '2.5.0'}  # pylint: disable=line-too-long
+            # to:
+            #  GCP(n1-highmem-8, {'tpu-v2-8': 1}...)
+            pattern = ', accelerator_args={.*}'
+            launched_resource_str = re.sub(pattern, '...',
+                                           launched_resource_str)


Seems if we specify the disk_type, disk_tier, etc, they will be after .... Is that intended?

Yes, it'd show GCP(n1-highmem-8, {'tpu-v2-8': 1}..., cpus=2+, dist_tier=high), etc. Wdyt?

concretevitamin added 2 commits June 12, 2023 21:41

Speed up refresh: delay the slower ray status call & use cached IPs.

7956574

Skip showing accelerator_args in status.

41a6907

Michaelvll approved these changes Jun 13, 2023

View reviewed changes

concretevitamin added 2 commits June 13, 2023 10:11

Update logging

c30e501

Format

d378ed5

concretevitamin requested a review from Michaelvll June 13, 2023 17:28

Michaelvll approved these changes Jun 13, 2023

View reviewed changes

concretevitamin merged commit 23552c0 into master Jun 13, 2023

concretevitamin deleted the refresh-opts branch June 13, 2023 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up refresh: delay the slower `ray status` call & use cached IPs. #2079

Speed up refresh: delay the slower `ray status` call & use cached IPs. #2079

concretevitamin commented Jun 13, 2023 •

edited

Loading

Michaelvll left a comment

Michaelvll Jun 13, 2023

concretevitamin Jun 13, 2023

Speed up refresh: delay the slower ray status call & use cached IPs. #2079

Speed up refresh: delay the slower ray status call & use cached IPs. #2079

Conversation

concretevitamin commented Jun 13, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jun 13, 2023

Choose a reason for hiding this comment

concretevitamin Jun 13, 2023

Choose a reason for hiding this comment

Speed up refresh: delay the slower `ray status` call & use cached IPs. #2079

Speed up refresh: delay the slower `ray status` call & use cached IPs. #2079

concretevitamin commented Jun 13, 2023 •

edited

Loading