Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Oct 22, 2025

Summary

Add a repartition call with target_num_rows_per_block=BATCH_SIZE to the audio transcription benchmark. This ensures blocks are appropriately sized to:

  • Prevent out-of-memory (OOM) errors
  • Ensure individual tasks don't take too long to complete

Changes

  • Added ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE) after reading the parquet file in ray_data_main.py:98

This change adds a repartition call with target_num_rows_per_block set to
BATCH_SIZE (64) to ensure blocks are appropriately sized. This prevents
out-of-memory errors and ensures individual tasks complete in a reasonable
timeframe rather than running too long.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) October 22, 2025 20:40
@bveeramani bveeramani changed the title Add repartition to audio transcription benchmark for proper block sizing [Data] Add repartition to audio transcription benchmark for proper block sizing Oct 22, 2025
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a repartition operation to the audio transcription benchmark. This is a sound and common practice in Ray Data pipelines to normalize block sizes after reading from a data source. By setting target_num_rows_per_block to BATCH_SIZE, the change ensures that downstream tasks, especially the GPU-intensive ones, receive data in consistently sized chunks. This helps prevent out-of-memory errors and long-running tasks that can occur with variably sized blocks from read_parquet. The use of a streaming repartition is efficient as it avoids a full data shuffle. The change is correct, well-placed, and should improve the stability and performance of the benchmark.

@bveeramani bveeramani merged commit d9c028e into master Oct 22, 2025
6 of 7 checks passed
@bveeramani bveeramani deleted the audio-transcription-repartition-fix branch October 22, 2025 21:04
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ock sizing (ray-project#58013)

## Summary

Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to
the audio transcription benchmark. This ensures blocks are appropriately
sized to:
- Prevent out-of-memory (OOM) errors
- Ensure individual tasks don't take too long to complete

## Changes

- Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)`
after reading the parquet file in `ray_data_main.py:98`

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ock sizing (ray-project#58013)

## Summary

Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to
the audio transcription benchmark. This ensures blocks are appropriately
sized to:
- Prevent out-of-memory (OOM) errors
- Ensure individual tasks don't take too long to complete

## Changes

- Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)`
after reading the parquet file in `ray_data_main.py:98`

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ock sizing (ray-project#58013)

## Summary

Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to
the audio transcription benchmark. This ensures blocks are appropriately
sized to:
- Prevent out-of-memory (OOM) errors
- Ensure individual tasks don't take too long to complete

## Changes

- Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)`
after reading the parquet file in `ray_data_main.py:98`

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants