Skip to content

Conversation

@axreldable
Copy link
Contributor

@axreldable axreldable commented Aug 23, 2025

Why are these changes needed?

This PR restores original behavior from the [Ray data llm] Allow specifying autoscaling actors for vllm:

  • supports Union[int, Tuple[int, int]] for concurrency
  • distinguishes cpu vs gpu scaling:
    • cpu: (1, n) or (m, n)
      • if user provides int - auto scaling from 1 to n applies
    • gpu: (n, n) or (m, n)
      • if user provides int - auto scaling is not applied, instead they can pass (1, n) explicitly
  • applies changes for vllm and sglang processors

On top of the former functionality:

  • changes concurrency type hint to be non Optional as None is not supported
  • introduces validation for the concurrency parameter using pydantic @field_validator
  • fixes typos in the SGLangEngineProcessorConfig doc strings vLLM -> SGLang

History behind this change

[data/llm] Allow specifying autoscaling actors for vllm

  • concurrency: int -> concurrency: Optional[Union[int, Tuple[int, int]]]
  • non-LLM stages: (1, n) or (m, n)
  • vLLM stage: n or (m, n)

[ray.data.llm] Add hint of how to optimize throughput

  • vLLM stage:
    • concurrency is changed to compute
    • introducing the bug when concurrency is tuple

[data.llm][Bugfix] Fix doc to only support int concurrency

  • remove tuple support

Now there is a request to restore tuple support for concurrency:
[Data] [LLM] Stage of engine support flexible compute.size

  • This PR addresses this request.

Related issue number

Closes #55480

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@axreldable axreldable requested review from a team as code owners August 23, 2025 18:22
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the concurrency argument to accept a tuple for defining an autoscaling worker pool range, in addition to an integer for a fixed number of workers. The changes are applied across several processor configurations and their documentation. My main feedback concerns a potential unintended change in behavior for the vLLMEngineStage when an integer is provided for concurrency, which now enables autoscaling instead of creating a fixed-size pool as it did previously. This contradicts an existing comment and could affect performance.

@ray-gardener ray-gardener bot added data Ray Data-related issues llm community-contribution Contributed by the community labels Aug 23, 2025
@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, which is a great enhancement for flexibility. The implementation looks good, with updates to docstrings and type hints to reflect this change.

I've left one comment regarding a potential issue when concurrency is set to 0, which could lead to an error. I've also suggested moving the new helper function to a more central location for better code organization and reusability.

Overall, a solid change that improves the usability of the LLM batch processing APIs.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to use a tuple for the concurrency argument to enable autoscaling actor pools. The changes are well-implemented across the public API docstrings and internal processor configurations. I've found a couple of minor issues. One is a potential bug where setting concurrency=None for the vLLM engine processor would lead to an autoscaling pool from 1 to infinity, which is likely not the intended behavior. The other is a slightly misleading error message in the validation logic. I've provided suggestions to address both.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, in addition to an integer. The changes are applied across several processor configurations and are supported by a new validation function and comprehensive tests. My review focuses on potential behavioral changes and edge cases introduced by this new flexibility. I've identified a few areas for improvement: ensuring safer default behavior when concurrency is None, clarifying an error message, and highlighting a potentially unintended change in how integer-based concurrency is handled for vLLMEngineStage which might impact performance.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends the concurrency argument in LLM data processing configurations to accept a tuple (min, max) for autoscaling actor pools. This is a great enhancement for flexibility. The implementation introduces a new validation function _processor_concurrency and updates various processor configurations and docstrings to reflect this change. New tests have also been added to cover the validation logic.

My main feedback is regarding the handling of integer values for concurrency, which now translates to an autoscaling pool from 1 to n. This changes the previous behavior for GPU stages which used a fixed-size pool for performance reasons, and is inconsistent with Ray Data's general API. I've suggested a change to align with the expected behavior.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the concurrency argument to accept a tuple for specifying a min/max range for autoscaling actor pools. The implementation is clean and well-tested. However, I've identified one significant issue: the change introduces a backward-incompatible behavior for the vLLMEngineStage when concurrency is an integer. This could lead to performance regressions due to vLLM's slow startup times. My review includes a detailed comment on this issue.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restores support for tuple-based concurrency arguments for autoscaling, which is a great improvement for flexibility. The added validation logic in _validate and the helper functions _to_auto_scale_cpu and _to_auto_scale_gpu are well-implemented and thoroughly tested.

My main feedback is that while the docstrings for SGLangEngineProcessorConfig have been updated to reflect this change, the underlying implementation in sglang_engine_proc.py has not been updated to handle tuple concurrency. This will lead to runtime errors. I've left a specific comment with a suggestion to address this.

@axreldable
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully restores the ability to specify concurrency as a tuple for autoscaling, a feature that was previously available. The changes are well-structured, introducing a new params.py module for validating and handling the concurrency parameter for both CPU and GPU scaling scenarios. This centralization of logic is a good improvement. The accompanying tests for the new module are comprehensive and cover various edge cases. The modifications in vllm and sglang processors correctly adopt the new utility functions. Docstrings and type hints are also updated to reflect these changes. Overall, this is a solid contribution.

@axreldable
Copy link
Contributor Author

Hi, @lk-chen!

Could you please check this PR?

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @axreldable ,

Thanks for your contribution. I think the approach make sense. I would just refactor it a bit differently as following:

Let's make the functionality in params.py part of the validation logic in ProcessingConfig in base.py:

class ProcessingConfig:

        def validate_concurrency(...)

        def concurrency_tuple(static=False):
                """if static=True it will return (n, n) if n is int else it will return (1, n). If n is a tuple it will return the tuple as is)"""

So instead of cpu and gpu concurrency we can do config.concurrency_tuple(static=False) and config.concurrency_tuple(static=True) respectively.

@alexeykudinkin alexeykudinkin removed the data Ray Data-related issues label Aug 25, 2025
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
@kouroshHakha kouroshHakha changed the title [llm] Allow tuple for concurrency arg [data.llm][API] Allow tuple for concurrency arg Sep 2, 2025
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axreldable There seems to be some release test regression regarding throughput. Please see the logs attached. It is timing out on FAILED test_batch_vllm.py::test_async_udf_queue_capped[4]

Anyscale LLM Batch Log.log

@axreldable
Copy link
Contributor Author

@kouroshHakha , thank you for raising it!

The error in logs is:

E               ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.

The test test_async_udf_queue_capped is failing for concurrency=4.

I doubt that I have changed the logic somehow.
image
It's requesting 4 executors for map_batches.

Is it enough resource for the regression test?

Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing to see why test_async_udf_queue_capped fails @ 4

update @axreldable

  1. I was able to run the same release test @ 1 and @ 4 successfully after checking out this PR branch.

  2. I was able to run the entire test script (not just the failing test @ concurrency 4) on my dev workspace

PASSED                                                                                                         
                                                                                                               
======================================= 7 passed in 1007.13s (0:16:47) ========================================
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) INFO 09-03 16:26:47 [loggers.py:122] Engine 000: Avg prompt throughput: 2927.8 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 68.1% [repeated 66x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) [vLLM] Elapsed time for batch 0b93d11f87004df2b4e32f86b0735b66 with size 3: 0.1276297929980501 [repeated 190x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) Shutting down vLLM engine [repeated 3x across cluster]
(base) ray@ip-10-0-218-26:~/default/work/ray$ 

I agree that you don't seem to be changing the core logic of autoscaling=False and min/max size.

Update

this may not be your code change's fault - that specific test was flaky / failing for some other unrelated PRs. Will look into this, revert my commit, and (hopefully) get this merged soon :)

@richardliaw richardliaw added the data Ray Data-related issues label Sep 5, 2025
@axreldable
Copy link
Contributor Author

Hi @nrghosh,
Thank you for looking into it! Just let me know if I can help in any way.

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh
Copy link
Contributor

nrghosh commented Sep 10, 2025

Hi @axreldable,

I was able to repro the error in a workspace that copies the compute/env configs from CI. I added a temporary cleanup fix (while we investigate some deeper issues around memory use and cleanup for Ray/VLLM) to address the resource contention hang that we were seeing in CI- and manually validated that it unblocks the failing release test on your branch.

(base) ray@ip:~/default/work/ray/release/llm_tests/batch$ pytest -vs --tb=short test_batch_vllm.py
======================== 8 passed in 1347.19s (0:22:27) =========================
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) INFO 09-09 17:03:24 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 68.1% [repeated 23x across cluster]                                                                                                                                                     
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) [vLLM] Elapsed time for batch 0d0ef92f2f3645e9837611bb519cdd77 with size 3: 0.19336262499928125 [repeated 63x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) Shutting down vLLM engine [repeated 3x across cluster]

Pending tests, I'll check in with the team and see about moving this forward. Thanks again for the contribution 😄

Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - pending tests finally unblocked

@axreldable
Copy link
Contributor Author

Great, thank you, @nrghosh !

@kouroshHakha , could you please take a look?

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @nrghosh for making the release test pass and @axreldable for the contribution.

@kouroshHakha kouroshHakha enabled auto-merge (squash) September 10, 2025 15:36
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@kouroshHakha kouroshHakha merged commit a8ba8a6 into ray-project:master Sep 10, 2025
6 checks passed
@axreldable axreldable deleted the llm_compute branch September 11, 2025 13:05
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/ray-project_ray_pr_55867_e2c4cc0d-f7c8-4481-9dae-daaa74e83209 that referenced this pull request Oct 22, 2025
Original PR #55867 by axreldable
Original: ray-project/ray#55867
snorkelopstesting3-bot added a commit to snorkel-marlin-repos/ray-project_ray_pr_55867_e2c4cc0d-f7c8-4481-9dae-daaa74e83209 that referenced this pull request Oct 22, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] [LLM] Stage of engine support flexible compute.size

6 participants