Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up refresh: delay the slower ray status call & use cached IPs. #2079

Merged
merged 4 commits into from
Jun 13, 2023

Conversation

concretevitamin
Copy link
Member

@concretevitamin concretevitamin commented Jun 13, 2023

I noticed that status -r is taking way too long. With some investigation with @Michaelvll, this PR Implemented two optimizations:

  • delay the slower ray status call
  • use cached IPs

All tests below are 1-node on-demand GCP clusters. These are all "normal" case where the runtime on the cluster did not become problematic.

  • TODO: we should keep speeding up refresh in the future.
    • identified handle.external_ips(use_cached_ips=False) -> ray get head-ip/worker-ips is too slow; replace with NodeProvider calls or cloud CLI/SDK calls?

Results:

do ray status last + use_cached_ips=True

STOPPED -> STOPPED

  • before: 31s
  • now: 3.7s

UP, autostop not set -> UP

  • before: 8.9s
  • now: 7.1s

UP, autostopped -> STOPPED

  • before: 22.2s
  • now: 4.2s

UP, autostop set -> UP

  • before: 7.6s
  • now: 7.3s

INIT -> INIT

  • before: 25.7s
  • now: 22.3s

(if we do the first optimization only) do ray status last

STOPPED -> STOPPED

  • before: 31s
  • now: 3.9s

UP, autostop not set -> UP

  • before: 8.9s
  • now: 10.9s

UP, autostopped -> STOPPED

  • before: 22.2s
  • now: 6s

UP, autostop set -> UP

  • before: 7.6s
  • now: 11.4s

INIT -> INIT

  • before: 25.7s
  • now: 22.4s

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below): above
  • All smoke tests: pytest tests/test_smoke.py : pytest tests/test_smoke.py --generic-cloud aws
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the optimization and benchmark @concretevitamin! The changes look good to me.

Comment on lines +330 to +337
# accelerator_args is way too long.
# Convert from:
# GCP(n1-highmem-8, {'tpu-v2-8': 1}, accelerator_args={'runtime_version': '2.5.0'} # pylint: disable=line-too-long
# to:
# GCP(n1-highmem-8, {'tpu-v2-8': 1}...)
pattern = ', accelerator_args={.*}'
launched_resource_str = re.sub(pattern, '...',
launched_resource_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems if we specify the disk_type, disk_tier, etc, they will be after .... Is that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it'd show GCP(n1-highmem-8, {'tpu-v2-8': 1}..., cpus=2+, dist_tier=high), etc. Wdyt?

@concretevitamin concretevitamin merged commit 23552c0 into master Jun 13, 2023
@concretevitamin concretevitamin deleted the refresh-opts branch June 13, 2023 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants