[data.llm][API] Allow tuple for concurrency arg #55867

axreldable · 2025-08-23T18:22:49Z

Why are these changes needed?

This PR restores original behavior from the [Ray data llm] Allow specifying autoscaling actors for vllm:

supports Union[int, Tuple[int, int]] for concurrency
distinguishes cpu vs gpu scaling:
- cpu: (1, n) or (m, n)
  - if user provides int - auto scaling from 1 to n applies
- gpu: (n, n) or (m, n)
  - if user provides int - auto scaling is not applied, instead they can pass (1, n) explicitly
applies changes for vllm and sglang processors

On top of the former functionality:

changes concurrency type hint to be non Optional as None is not supported
introduces validation for the concurrency parameter using pydantic @field_validator
fixes typos in the SGLangEngineProcessorConfig doc strings vLLM -> SGLang

History behind this change

[data/llm] Allow specifying autoscaling actors for vllm

concurrency: int -> concurrency: Optional[Union[int, Tuple[int, int]]]
non-LLM stages: (1, n) or (m, n)
vLLM stage: n or (m, n)

[ray.data.llm] Add hint of how to optimize throughput

vLLM stage:
- concurrency is changed to compute
- introducing the bug when concurrency is tuple

[data.llm][Bugfix] Fix doc to only support int concurrency

remove tuple support

Now there is a request to restore tuple support for concurrency:
[Data] [LLM] Stage of engine support flexible compute.size

This PR addresses this request.

Related issue number

Closes #55480

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>

gemini-code-assist

Code Review

This pull request updates the concurrency argument to accept a tuple for defining an autoscaling worker pool range, in addition to an integer for a fixed number of workers. The changes are applied across several processor configurations and their documentation. My main feedback concerns a potential unintended change in behavior for the vLLMEngineStage when an integer is provided for concurrency, which now enables autoscaling instead of creating a fixed-size pool as it did previously. This contradicts an existing comment and could affect performance.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-23T19:46:04Z

/gemini review

gemini-code-assist

Code Review

This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, which is a great enhancement for flexibility. The implementation looks good, with updates to docstrings and type hints to reflect this change.

I've left one comment regarding a potential issue when concurrency is set to 0, which could lead to an error. I've also suggested moving the new helper function to a more central location for better code organization and reusability.

Overall, a solid change that improves the usability of the LLM batch processing APIs.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-23T20:41:08Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the ability to use a tuple for the concurrency argument to enable autoscaling actor pools. The changes are well-implemented across the public API docstrings and internal processor configurations. I've found a couple of minor issues. One is a potential bug where setting concurrency=None for the vLLM engine processor would lead to an autoscaling pool from 1 to infinity, which is likely not the intended behavior. The other is a slightly misleading error message in the validation logic. I've provided suggestions to address both.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-23T20:50:28Z

/gemini review

gemini-code-assist

Code Review

This pull request extends the concurrency argument to accept a tuple for defining an autoscaling actor pool range, in addition to an integer. The changes are applied across several processor configurations and are supported by a new validation function and comprehensive tests. My review focuses on potential behavioral changes and edge cases introduced by this new flexibility. I've identified a few areas for improvement: ensuring safer default behavior when concurrency is None, clarifying an error message, and highlighting a potentially unintended change in how integer-based concurrency is handled for vLLMEngineStage which might impact performance.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-23T20:55:29Z

/gemini review

gemini-code-assist

Code Review

This pull request extends the concurrency argument in LLM data processing configurations to accept a tuple (min, max) for autoscaling actor pools. This is a great enhancement for flexibility. The implementation introduces a new validation function _processor_concurrency and updates various processor configurations and docstrings to reflect this change. New tests have also been added to cover the validation logic.

My main feedback is regarding the handling of integer values for concurrency, which now translates to an autoscaling pool from 1 to n. This changes the previous behavior for GPU stages which used a fixed-size pool for performance reasons, and is inconsistent with Ray Data's general API. I've suggested a change to align with the expected behavior.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-23T21:05:25Z

/gemini review

gemini-code-assist

Code Review

This pull request updates the concurrency argument to accept a tuple for specifying a min/max range for autoscaling actor pools. The implementation is clean and well-tested. However, I've identified one significant issue: the change introduces a backward-incompatible behavior for the vLLMEngineStage when concurrency is an integer. This could lead to performance regressions due to vLLM's slow startup times. My review includes a detailed comment on this issue.

python/ray/llm/_internal/batch/processor/vllm_engine_proc.py

axreldable · 2025-08-24T11:39:30Z

/gemini review

gemini-code-assist

Code Review

This pull request restores support for tuple-based concurrency arguments for autoscaling, which is a great improvement for flexibility. The added validation logic in _validate and the helper functions _to_auto_scale_cpu and _to_auto_scale_gpu are well-implemented and thoroughly tested.

My main feedback is that while the docstrings for SGLangEngineProcessorConfig have been updated to reflect this change, the underlying implementation in sglang_engine_proc.py has not been updated to handle tuple concurrency. This will lead to runtime errors. I've left a specific comment with a suggestion to address this.

python/ray/data/llm.py

axreldable · 2025-08-24T12:04:34Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully restores the ability to specify concurrency as a tuple for autoscaling, a feature that was previously available. The changes are well-structured, introducing a new params.py module for validating and handling the concurrency parameter for both CPU and GPU scaling scenarios. This centralization of logic is a good improvement. The accompanying tests for the new module are comprehensive and cover various edge cases. The modifications in vllm and sglang processors correctly adopt the new utility functions. Docstrings and type hints are also updated to reflect these changes. Overall, this is a solid contribution.

python/ray/llm/_internal/batch/processor/params.py

axreldable · 2025-08-24T12:13:51Z

Hi, @lk-chen!

Could you please check this PR?

kouroshHakha

Hey @axreldable ,

Thanks for your contribution. I think the approach make sense. I would just refactor it a bit differently as following:

Let's make the functionality in params.py part of the validation logic in ProcessingConfig in base.py:

class ProcessingConfig:

        def validate_concurrency(...)

        def concurrency_tuple(static=False):
                """if static=True it will return (n, n) if n is int else it will return (1, n). If n is a tuple it will return the tuple as is)"""

So instead of cpu and gpu concurrency we can do config.concurrency_tuple(static=False) and config.concurrency_tuple(static=True) respectively.

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>

python/ray/llm/_internal/batch/processor/base.py

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>

kouroshHakha

Thanks.

kouroshHakha

@axreldable There seems to be some release test regression regarding throughput. Please see the logs attached. It is timing out on FAILED test_batch_vllm.py::test_async_udf_queue_capped[4]

Anyscale LLM Batch Log.log

axreldable · 2025-09-03T19:25:48Z

@kouroshHakha , thank you for raising it!

The error in logs is:

E               ray.exceptions.GetTimeoutError: Timed out while starting actors. This may mean that the cluster does not have enough resources for the requested actor pool.

The test test_async_udf_queue_capped is failing for concurrency=4.

I doubt that I have changed the logic somehow.

It's requesting 4 executors for map_batches.

Is it enough resource for the regression test?

nrghosh

testing to see why test_async_udf_queue_capped fails @ 4

update @axreldable

I was able to run the same release test @ 1 and @ 4 successfully after checking out this PR branch.
I was able to run the entire test script (not just the failing test @ concurrency 4) on my dev workspace

PASSED                                                                                                         
                                                                                                               
======================================= 7 passed in 1007.13s (0:16:47) ========================================
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) INFO 09-03 16:26:47 [loggers.py:122] Engine 000: Avg prompt throughput: 2927.8 tokens/s, Avg generation throughput: 62.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 68.1% [repeated 66x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) [vLLM] Elapsed time for batch 0b93d11f87004df2b4e32f86b0735b66 with size 3: 0.1276297929980501 [repeated 190x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=759283) Shutting down vLLM engine [repeated 3x across cluster]
(base) ray@ip-10-0-218-26:~/default/work/ray$

I agree that you don't seem to be changing the core logic of autoscaling=False and min/max size.

Update

this may not be your code change's fault - that specific test was flaky / failing for some other unrelated PRs. Will look into this, revert my commit, and (hopefully) get this merged soon :)

axreldable · 2025-09-06T13:31:56Z

Hi @nrghosh,
Thank you for looking into it! Just let me know if I can help in any way.

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh · 2025-09-10T00:10:26Z

Hi @axreldable,

I was able to repro the error in a workspace that copies the compute/env configs from CI. I added a temporary cleanup fix (while we investigate some deeper issues around memory use and cleanup for Ray/VLLM) to address the resource contention hang that we were seeing in CI- and manually validated that it unblocks the failing release test on your branch.

(base) ray@ip:~/default/work/ray/release/llm_tests/batch$ pytest -vs --tb=short test_batch_vllm.py
======================== 8 passed in 1347.19s (0:22:27) =========================
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) INFO 09-09 17:03:24 [loggers.py:122] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 68.1% [repeated 23x across cluster]                                                                                                                                                     
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) [vLLM] Elapsed time for batch 0d0ef92f2f3645e9837611bb519cdd77 with size 3: 0.19336262499928125 [repeated 63x across cluster]
(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=722189) Shutting down vLLM engine [repeated 3x across cluster]

Pending tests, I'll check in with the team and see about moving this forward. Thanks again for the contribution 😄

nrghosh

lgtm - pending tests finally unblocked

axreldable · 2025-09-10T10:05:44Z

Great, thank you, @nrghosh !

@kouroshHakha , could you please take a look?

kouroshHakha

LGTM. Thanks @nrghosh for making the release test pass and @axreldable for the contribution.

bveeramani

Stamp

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Original PR #55867 by axreldable Original: ray-project/ray#55867

Merged from original PR #55867 Original: ray-project/ray#55867

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

axreldable added 3 commits August 19, 2025 22:39

Create devcontainer.json

3c680e7

Signed-off-by: Aleksei Starikov <aleksei.starikov.ax@gmail.com>

Merge branch 'ray-project:master' into master

bc1037a

Merge branch 'ray-project:master' into master

195eae1

axreldable requested review from a team as code owners August 23, 2025 18:22