[Data] Add repartition to audio transcription benchmark for proper block sizing #58013

bveeramani · 2025-10-22T20:39:44Z

Summary

Add a repartition call with target_num_rows_per_block=BATCH_SIZE to the audio transcription benchmark. This ensures blocks are appropriately sized to:

Prevent out-of-memory (OOM) errors
Ensure individual tasks don't take too long to complete

Changes

Added ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE) after reading the parquet file in ray_data_main.py:98

This change adds a repartition call with target_num_rows_per_block set to BATCH_SIZE (64) to ensure blocks are appropriately sized. This prevents out-of-memory errors and ensures individual tasks complete in a reasonable timeframe rather than running too long. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist

Code Review

This pull request introduces a repartition operation to the audio transcription benchmark. This is a sound and common practice in Ray Data pipelines to normalize block sizes after reading from a data source. By setting target_num_rows_per_block to BATCH_SIZE, the change ensures that downstream tasks, especially the GPU-intensive ones, receive data in consistently sized chunks. This helps prevent out-of-memory errors and long-running tasks that can occur with variably sized blocks from read_parquet. The use of a streaming repartition is efficient as it avoids a full data shuffle. The change is correct, well-placed, and should improve the stability and performance of the benchmark.

…ock sizing (ray-project#58013) ## Summary Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to the audio transcription benchmark. This ensures blocks are appropriately sized to: - Prevent out-of-memory (OOM) errors - Ensure individual tasks don't take too long to complete ## Changes - Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)` after reading the parquet file in `ray_data_main.py:98` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: xgui <xgui@anyscale.com>

…ock sizing (ray-project#58013) ## Summary Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to the audio transcription benchmark. This ensures blocks are appropriately sized to: - Prevent out-of-memory (OOM) errors - Ensure individual tasks don't take too long to complete ## Changes - Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)` after reading the parquet file in `ray_data_main.py:98` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ock sizing (ray-project#58013) ## Summary Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to the audio transcription benchmark. This ensures blocks are appropriately sized to: - Prevent out-of-memory (OOM) errors - Ensure individual tasks don't take too long to complete ## Changes - Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)` after reading the parquet file in `ray_data_main.py:98` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

bveeramani assigned omatthew98 Oct 22, 2025

bveeramani enabled auto-merge (squash) October 22, 2025 20:40

bveeramani changed the title ~~Add repartition to audio transcription benchmark for proper block sizing~~ [Data] Add repartition to audio transcription benchmark for proper block sizing Oct 22, 2025

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025

omatthew98 approved these changes Oct 22, 2025

View reviewed changes

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

bveeramani merged commit d9c028e into master Oct 22, 2025
6 of 7 checks passed

bveeramani deleted the audio-transcription-repartition-fix branch October 22, 2025 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add repartition to audio transcription benchmark for proper block sizing #58013

[Data] Add repartition to audio transcription benchmark for proper block sizing #58013

Uh oh!

bveeramani commented Oct 22, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Data] Add repartition to audio transcription benchmark for proper block sizing #58013

[Data] Add repartition to audio transcription benchmark for proper block sizing #58013

Uh oh!

Conversation

bveeramani commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bveeramani commented Oct 22, 2025 •

edited

Loading