Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Optimize ServeController.get_app_config() #45878

Merged

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Jun 11, 2024

Why are these changes needed?

Currently, ServeController.get_app_config() is called once for each application during ServeController.get_serve_instance_details(). Each of those calls requires a round-trip to the GCS to get the Serve checkpoint, but the Serve checkpoint has all of the information for all apps already. This PR replaces ServeController.get_app_config() with a new method that I've very cleverly named ServeController.get_app_configs(), which effectively pre-fetches all the application configs and leaves it up to the caller to decide what to do with them.

Before, 1800 apps with 1 deployment each, no DeploymentHandles, with #45872 :

controller-high-load-no-handles-after-fix

image

After: so little time it doesn't even appear in the flamegraph!

image

controller-1800-apps-after-get-app-config-fix

Related issue number

Discovered alongside #45872

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Comment on lines -382 to -385
@_ensure_connected
def get_app_config(self, name: str = SERVE_DEFAULT_APP_NAME) -> Dict:
"""Returns the most recently requested Serve config."""
return ray.get(self._controller.get_app_config.remote(name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be unused, so I went ahead and removed it instead of refactoring it. But I'm happy to provide an equivalent method in this PR if desired.

@JoshKarpel
Copy link
Contributor Author

JoshKarpel commented Jun 12, 2024

I'm seeing a test failure on test_get_applications_while_gcs_down https://buildkite.com/ray-project/microcheck/builds/1501#01900d0b-bdb6-4ed7-b49e-f6716901d47f that seems to be coming from

try:
details = await controller.get_serve_instance_details.remote()
except ray.exceptions.RayTaskError as e:
# Task failure sometimes are due to GCS
# failure. When GCS failed, we expect a longer time
# to recover.
return Response(
status=503,
text=(
"Failed to get a response from the controller. "
f"The GCS may be down, please retry later: {e}"
),
)
returning an HTTP 503 error, but I don't see how my changes could change the semantics such that that test could fail now but pass previously 🤔 ... the test seems to be expecting to get HTTP 200 responses even when GCS is down?

@JoshKarpel JoshKarpel marked this pull request as ready for review June 12, 2024 17:30
Copy link

@venkatkalluru venkatkalluru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I left a couple comments.

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved
python/ray/serve/_private/controller.py Show resolved Hide resolved
@@ -877,6 +877,8 @@ def get_serve_instance_details(self) -> Dict:
grpc_config = self.get_grpc_config()
applications = {}

app_configs = self.get_app_configs() or {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a test failure on test_get_applications_while_gcs_down

I believe it's because you're trying to access the KV store while the GCS is down. Could you change this to not fetch the app configs when there are no applications?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just pushed this up 🤞🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is passing now! Re-requested your review

JoshKarpel and others added 3 commits June 20, 2024 13:00
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zcin zcin added the go add ONLY when ready to merge, run all tests label Jun 24, 2024
@shrekris-anyscale
Copy link
Contributor

Premerge is failing @JoshKarpel. Could you take a look?

@JoshKarpel
Copy link
Contributor Author

It looks like a backwards compat test in the Ray Jobs API with a (hopefully) unrelated error https://buildkite.com/ray-project/premerge/builds/27215#01904b67-78b6-49f9-b489-0c09dd2a9dd5:


[2024-06-24T18:22:19Z] 2024-06-24 18:21:31,250	ERR cli.py:68 -- ---------------------------------------------
--
  | [2024-06-24T18:22:19Z] 2024-06-24 18:21:31,250	ERR cli.py:69 -- Job 'c20edd7507564af4960751cdb1f096f9' failed
  | [2024-06-24T18:22:19Z] 2024-06-24 18:21:31,250	ERR cli.py:70 -- ---------------------------------------------
  | [2024-06-24T18:22:19Z] 2024-06-24 18:21:31,250	INFO cli.py:83 -- Status message: Unexpected error occurred: The actor died because of an error raised in its creation task, ray::_ray_internal_job_actor_c20edd7507564af4960751cdb1f096f9:JobSupervisor.__init__ (pid=2706, ip=172.16.0.5, repr=<ray.dashboard.modules.job.job_manager.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fe8210a0070>)
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/concurrent/futures/_base.py", line 439, in result
  | [2024-06-24T18:22:19Z]     return self.__get_result()
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
  | [2024-06-24T18:22:19Z]     raise self._exception
  | [2024-06-24T18:22:19Z] RuntimeError: The actor with name JobSupervisor failed to import on the worker. This may be because needed library dependencies are not installed in the worker environment:
  | [2024-06-24T18:22:19Z]
  | [2024-06-24T18:22:19Z] ray::_ray_internal_job_actor_c20edd7507564af4960751cdb1f096f9:JobSupervisor.__init__ (pid=2706, ip=172.16.0.5, repr=<ray.dashboard.modules.job.job_manager.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x7fe8210a0070>)
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/site-packages/ray/dashboard/modules/job/job_manager.py", line 28, in <module>
  | [2024-06-24T18:22:19Z]     from ray.job_submission import JobStatus
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/site-packages/ray/job_submission/__init__.py", line 2, in <module>
  | [2024-06-24T18:22:19Z]     from ray.dashboard.modules.job.sdk import JobSubmissionClient
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/site-packages/ray/dashboard/modules/job/sdk.py", line 24, in <module>
  | [2024-06-24T18:22:19Z]     from ray.dashboard.modules.dashboard_sdk import SubmissionClient
  | [2024-06-24T18:22:19Z]   File "/opt/miniconda/envs/jobs-backwards-compatibility-f3829cebbbab4d94b5d878308ae2264f/lib/python3.9/site-packages/ray/dashboard/modules/dashboard_sdk.py", line 9, in <module>
  | [2024-06-24T18:22:19Z]     from pkg_resources import packaging
  | [2024-06-24T18:22:19Z] ImportError: cannot import name 'packaging' from 'pkg_resources' (/tmp/ray/session_2024-06-24_18-21-21_620793_2512/runtime_resources/pip/8e07f3c7538ebeb5b023a55baf02d6c8fee8bc80/virtualenv/lib/python3.9/site-packages/pkg_resources/__init__.py)

I merged latest master, will see if it still fails on the new CI run 🤞🏻

@JoshKarpel
Copy link
Contributor Author

Looks like the failure was ephemeral @shrekris-anyscale 🤞🏻

@shrekris-anyscale shrekris-anyscale merged commit a709b8f into ray-project:master Jun 25, 2024
6 checks passed
@JoshKarpel
Copy link
Contributor Author

Thanks!

@JoshKarpel JoshKarpel deleted the optimize-get-app-config branch June 26, 2024 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants