-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[data.llm][API] Allow tuple for concurrency arg #55867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the concurrency argument to accept a tuple for defining an autoscaling worker pool range, in addition to an integer for a fixed number of workers. The changes are applied across several processor configurations and their documentation. My main feedback concerns a potential unintended change in behavior for the vLLMEngineStage when an integer is provided for concurrency, which now enables autoscaling instead of creating a fixed-size pool as it did previously. This contradicts an existing comment and could affect performance.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, which is a great enhancement for flexibility. The implementation looks good, with updates to docstrings and type hints to reflect this change.
I've left one comment regarding a potential issue when concurrency is set to 0, which could lead to an error. I've also suggested moving the new helper function to a more central location for better code organization and reusability.
Overall, a solid change that improves the usability of the LLM batch processing APIs.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the ability to use a tuple for the concurrency argument to enable autoscaling actor pools. The changes are well-implemented across the public API docstrings and internal processor configurations. I've found a couple of minor issues. One is a potential bug where setting concurrency=None for the vLLM engine processor would lead to an autoscaling pool from 1 to infinity, which is likely not the intended behavior. The other is a slightly misleading error message in the validation logic. I've provided suggestions to address both.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, in addition to an integer. The changes are applied across several processor configurations and are supported by a new validation function and comprehensive tests. My review focuses on potential behavioral changes and edge cases introduced by this new flexibility. I've identified a few areas for improvement: ensuring safer default behavior when concurrency is None, clarifying an error message, and highlighting a potentially unintended change in how integer-based concurrency is handled for vLLMEngineStage which might impact performance.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends the concurrency argument in LLM data processing configurations to accept a tuple (min, max) for autoscaling actor pools. This is a great enhancement for flexibility. The implementation introduces a new validation function _processor_concurrency and updates various processor configurations and docstrings to reflect this change. New tests have also been added to cover the validation logic.
My main feedback is regarding the handling of integer values for concurrency, which now translates to an autoscaling pool from 1 to n. This changes the previous behavior for GPU stages which used a fixed-size pool for performance reasons, and is inconsistent with Ray Data's general API. I've suggested a change to align with the expected behavior.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the concurrency argument to accept a tuple for specifying a min/max range for autoscaling actor pools. The implementation is clean and well-tested. However, I've identified one significant issue: the change introduces a backward-incompatible behavior for the vLLMEngineStage when concurrency is an integer. This could lead to performance regressions due to vLLM's slow startup times. My review includes a detailed comment on this issue.
e797980 to
8090d1d
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request restores support for tuple-based concurrency arguments for autoscaling, which is a great improvement for flexibility. The added validation logic in _validate and the helper functions _to_auto_scale_cpu and _to_auto_scale_gpu are well-implemented and thoroughly tested.
My main feedback is that while the docstrings for SGLangEngineProcessorConfig have been updated to reflect this change, the underlying implementation in sglang_engine_proc.py has not been updated to handle tuple concurrency. This will lead to runtime errors. I've left a specific comment with a suggestion to address this.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully restores the ability to specify concurrency as a tuple for autoscaling, a feature that was previously available. The changes are well-structured, introducing a new params.py module for validating and handling the concurrency parameter for both CPU and GPU scaling scenarios. This centralization of logic is a good improvement. The accompanying tests for the new module are comprehensive and cover various edge cases. The modifications in vllm and sglang processors correctly adopt the new utility functions. Docstrings and type hints are also updated to reflect these changes. Overall, this is a solid contribution.
7394882 to
1ec6352
Compare
|
Hi, @lk-chen! Could you please check this PR? |
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @axreldable ,
Thanks for your contribution. I think the approach make sense. I would just refactor it a bit differently as following:
Let's make the functionality in params.py part of the validation logic in ProcessingConfig in base.py:
class ProcessingConfig:
def validate_concurrency(...)
def concurrency_tuple(static=False):
"""if static=True it will return (n, n) if n is int else it will return (1, n). If n is a tuple it will return the tuple as is)"""So instead of cpu and gpu concurrency we can do config.concurrency_tuple(static=False) and config.concurrency_tuple(static=True) respectively.
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
b3ed43d to
aed5e98
Compare
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@axreldable There seems to be some release test regression regarding throughput. Please see the logs attached. It is timing out on FAILED test_batch_vllm.py::test_async_udf_queue_capped[4]
|
@kouroshHakha , thank you for raising it! The error in logs is: The test test_async_udf_queue_capped is failing for I doubt that I have changed the logic somehow. Is it enough resource for the regression test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing to see why test_async_udf_queue_capped fails @ 4
update @axreldable
-
I was able to run the same release test @ 1 and @ 4 successfully after checking out this PR branch.
-
I was able to run the entire test script (not just the failing test @ concurrency 4) on my dev workspace
PASSED
======================================= 7 passed in 1007.13s (0:16:47) ========================================
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) INFO 09-03 16:26:47 [loggers.py:122] Engine 000: Avg prompt throughput: 2927.8 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 68.1% [repeated 66x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) [vLLM] Elapsed time for batch 0b93d11f87004df2b4e32f86b0735b66 with size 3: 0.1276297929980501 [repeated 190x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) Shutting down vLLM engine [repeated 3x across cluster]
(base) ray@ip-10-0-218-26:~/default/work/ray$
I agree that you don't seem to be changing the core logic of autoscaling=False and min/max size.
Update
this may not be your code change's fault - that specific test was flaky / failing for some other unrelated PRs. Will look into this, revert my commit, and (hopefully) get this merged soon :)
|
Hi @nrghosh, |
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
|
Hi @axreldable, I was able to repro the error in a workspace that copies the compute/env configs from CI. I added a temporary cleanup fix (while we investigate some deeper issues around memory use and cleanup for Ray/VLLM) to address the resource contention hang that we were seeing in CI- and manually validated that it unblocks the failing release test on your branch. Pending tests, I'll check in with the team and see about moving this forward. Thanks again for the contribution 😄 |
nrghosh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - pending tests finally unblocked
|
Great, thank you, @nrghosh ! @kouroshHakha , could you please take a look? |
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @nrghosh for making the release test pass and @axreldable for the contribution.
bveeramani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: zac <zac@anyscale.com>
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Original PR #55867 by axreldable Original: ray-project/ray#55867
Merged from original PR #55867 Original: ray-project/ray#55867
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Why are these changes needed?
This PR restores original behavior from the [Ray data llm] Allow specifying autoscaling actors for vllm:
Union[int, Tuple[int, int]]for concurrencycpuvsgpuscaling:cpu: (1, n) or (m, n)int- auto scaling from1tonappliesgpu: (n, n) or (m, n)int- auto scaling is not applied, instead they can pass (1, n) explicitlyvllmandsglangprocessorsOn top of the former functionality:
concurrencytype hint to be nonOptionalasNoneis not supported@field_validatorSGLangEngineProcessorConfigdoc stringsvLLM->SGLangHistory behind this change
[data/llm] Allow specifying autoscaling actors for vllm
[ray.data.llm] Add hint of how to optimize throughput
[data.llm][Bugfix] Fix doc to only support int concurrency
Now there is a request to restore tuple support for concurrency:
[Data] [LLM] Stage of engine support flexible compute.size
Related issue number
Closes #55480
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.