[Data] Update document embedding benchmark to use canonical Ray Data API #57977

bveeramani · 2025-10-21T23:58:48Z

Summary

This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework.

Key Changes

Use `download()` expression instead of separate materialization

Before:

file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)

After:

(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))

This change:

Eliminates the intermediate materialization with take_all(), which loads all data into memory
Uses the download() expression to lazily fetch file contents as part of the pipeline
Removes the need for a separate read_binary_files() call

Method chaining for cleaner code

All operations are now chained in a single pipeline, making the data flow more clear and idiomatic.

Consistent column naming

Updated references from path to uploaded_pdf_path throughout the code for consistency with the source data schema.

…Data API This change modernizes the document embedding benchmark to follow Ray Data best practices and use the canonical API pattern. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist

Code Review

This pull request is a great improvement, refactoring the document embedding benchmark to use the idiomatic Ray Data API. The switch from take_all() and read_binary_files to the lazy download() expression is a significant enhancement for performance and memory efficiency, especially for large datasets. The code is now much cleaner and easier to follow with the single chained pipeline.

I have a suggestion to further improve the code's adherence to Ray Data best practices by using a col expression for filtering instead of a lambda function. This can make the code more declarative and potentially unlock further optimizations.

gemini-code-assist · 2025-10-22T00:00:31Z

release/nightly_tests/multimodal_inference_benchmarks/document_embedding/ray_data_main.py

 import pymupdf
 import ray
 import ray.data
+from ray.data.expressions import download


To make the code even more idiomatic and potentially more performant through expression optimization, consider importing col here. It can be used to replace the lambda in the filter operation below.

Suggested change

from ray.data.expressions import download

from ray.data.expressions import col, download

gemini-code-assist · 2025-10-22T00:00:32Z

release/nightly_tests/multimodal_inference_benchmarks/document_embedding/ray_data_main.py

-file_paths = (
+(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))


Using a col expression is more idiomatic in Ray Data than using a lambda for simple filtering operations. This declarative style can also allow Ray Data to perform more optimizations on the execution plan.

Suggested change

.filter(lambda row: row["file_name"].endswith(".pdf"))

.filter(col("file_name").str.endswith(".pdf"))

omatthew98

🔥

…API (#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…API (ray-project#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…API (ray-project#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

[Benchmark] Update document embedding benchmark to use canonical Ray …

2f41089

…Data API This change modernizes the document embedding benchmark to follow Ray Data best practices and use the canonical API pattern. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani changed the title ~~[Benchmark] Update document embedding benchmark to use canonical Ray Data API~~ [Data] Update document embedding benchmark to use canonical Ray Data API Oct 21, 2025

bveeramani assigned omatthew98 Oct 21, 2025

gemini-code-assist bot reviewed Oct 22, 2025

View reviewed changes

omatthew98 approved these changes Oct 22, 2025

View reviewed changes

bveeramani enabled auto-merge (squash) October 22, 2025 00:01

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025

bveeramani merged commit b24efee into master Oct 22, 2025
6 of 7 checks passed

bveeramani deleted the benchmark-update-ray-data-canonical branch October 22, 2025 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Update document embedding benchmark to use canonical Ray Data API #57977

[Data] Update document embedding benchmark to use canonical Ray Data API #57977

Uh oh!

bveeramani commented Oct 21, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 22, 2025

Uh oh!

gemini-code-assist bot Oct 22, 2025

Uh oh!

omatthew98 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	from ray.data.expressions import download
	from ray.data.expressions import col, download

	.filter(lambda row: row["file_name"].endswith(".pdf"))
	.filter(col("file_name").str.endswith(".pdf"))

[Data] Update document embedding benchmark to use canonical Ray Data API #57977

[Data] Update document embedding benchmark to use canonical Ray Data API #57977

Uh oh!

Conversation

bveeramani commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Use download() expression instead of separate materialization

Method chaining for cleaner code

Consistent column naming

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

omatthew98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bveeramani commented Oct 21, 2025 •

edited

Loading

Use `download()` expression instead of separate materialization