Skip to content

Conversation

@nrghosh
Copy link
Contributor

@nrghosh nrghosh commented Oct 17, 2025

Description

Enable users to control resources and concurrency of preprocess/postprocess stages independently from the main LLM stage by passing Dataset.map() kwargs.

  • Add preprocess_map_kwargs and postprocess_map_kwargs parameters to build_llm_processor() and all builder functions
  • Update Processor class to store and apply map kwargs to dataset.map() calls
  • Add validation for map kwargs with warnings on unknown keys
  • Unit tests + Update docstrings

Users can now provision fractional CPU resources and tune parameters per stage without workarounds- improving utilization.

Related issues

Addresses #57812

Additional information

Example usage:

  processor = build_llm_processor(
      config,
      preprocess=preprocess_fn,
      postprocess=postprocess_fn,
      preprocess_map_kwargs={"num_cpus": 0.5},
      postprocess_map_kwargs={"num_cpus": 0.25},
  )

instead of

# preprocess 
ds = ray_dataset.map_batches(
    preprocess,  
    batch_size=1,  
    num_cpus=0.5,
    num_gpus=0,
)

# processor with preprocess=None, postprocess=None
proc = build_llm_processor(processor_config, preprocess=None, postprocess=None)

# vLLM stage
generated = proc(ds)

# Postprocess
captioned = generated.map_batches(
    postprocess,
    num_cpus=0.25,   
    num_gpus=0,
)

nrghosh and others added 5 commits October 16, 2025 16:39
…ss/postprocess

Addresses ray-project#57812.

Enable users to control resources and concurrency of preprocess/postprocess
stages independently from the main LLM stage by passing Dataset.map() kwargs.

- Add preprocess_map_kwargs and postprocess_map_kwargs parameters to
  build_llm_processor() and all builder functions
- Update Processor class to store and apply map kwargs to dataset.map() calls
- Add validation for map kwargs with warnings on unknown keys
- Unit tests + Update docstrings

Users can now provision fractional CPU resources and tune parameters per
stage without workarounds- improving utilization.

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh added the go add ONLY when ready to merge, run all tests label Oct 17, 2025
@nrghosh nrghosh marked this pull request as ready for review October 17, 2025 02:45
@nrghosh nrghosh requested a review from a team as a code owner October 17, 2025 02:45
Comment on lines 385 to 410
@classmethod
def validate_map_kwargs(cls, map_kwargs: Optional[Dict[str, Any]]) -> None:
"""Validate map kwargs contain only supported Dataset.map parameters.
Args:
map_kwargs: Optional kwargs to pass to Dataset.map().
Note:
Unknown keys will trigger a warning as they'll be passed as ray_remote_args.
"""
if map_kwargs is None:
return

# Supported Dataset.map parameters
supported_keys = {
"compute",
"fn_args",
"fn_kwargs",
"fn_constructor_args",
"fn_constructor_kwargs",
"num_cpus",
"num_gpus",
"memory",
"concurrency",
"ray_remote_args_fn",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this list is dynamic and i don't think it's easy to maintain this one. We should just leave it up to the user to read the docs and provide the right map kwargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, resolved

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Oct 17, 2025
@kouroshHakha
Copy link
Contributor

@nrghosh tests are failing.

@nrghosh nrghosh force-pushed the nrghosh/add-pre-post-kwargs branch from dc0b178 to ca429d4 Compare October 17, 2025 20:25
- parameters change over time, don't enforce statically
- remove validation from llm.py::build_llm_processor

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh force-pushed the nrghosh/add-pre-post-kwargs branch from ca429d4 to 21b95e4 Compare October 17, 2025 21:08
@kouroshHakha kouroshHakha merged commit 22c755d into ray-project:master Oct 18, 2025
5 of 6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ss/postprocess (ray-project#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
…ss/postprocess (ray-project#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…ss/postprocess (#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ss/postprocess (ray-project#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ss/postprocess (ray-project#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ss/postprocess (ray-project#57826)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants