Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Oct 21, 2025

Summary

This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework.

Key Changes

Use download() expression instead of separate materialization

Before:

file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)

After:

(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))

This change:

  • Eliminates the intermediate materialization with take_all(), which loads all data into memory
  • Uses the download() expression to lazily fetch file contents as part of the pipeline
  • Removes the need for a separate read_binary_files() call

Method chaining for cleaner code

All operations are now chained in a single pipeline, making the data flow more clear and idiomatic.

Consistent column naming

Updated references from path to uploaded_pdf_path throughout the code for consistency with the source data schema.

…Data API

This change modernizes the document embedding benchmark to follow Ray Data best practices and use the canonical API pattern.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani changed the title [Benchmark] Update document embedding benchmark to use canonical Ray Data API [Data] Update document embedding benchmark to use canonical Ray Data API Oct 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement, refactoring the document embedding benchmark to use the idiomatic Ray Data API. The switch from take_all() and read_binary_files to the lazy download() expression is a significant enhancement for performance and memory efficiency, especially for large datasets. The code is now much cleaner and easier to follow with the single chained pipeline.

I have a suggestion to further improve the code's adherence to Ray Data best practices by using a col expression for filtering instead of a lambda function. This can make the code more declarative and potentially unlock further optimizations.

import pymupdf
import ray
import ray.data
from ray.data.expressions import download
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To make the code even more idiomatic and potentially more performant through expression optimization, consider importing col here. It can be used to replace the lambda in the filter operation below.

Suggested change
from ray.data.expressions import download
from ray.data.expressions import col, download

file_paths = (
(
ray.data.read_parquet(INPUT_PATH)
.filter(lambda row: row["file_name"].endswith(".pdf"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a col expression is more idiomatic in Ray Data than using a lambda for simple filtering operations. This declarative style can also allow Ray Data to perform more optimizations on the execution plan.

Suggested change
.filter(lambda row: row["file_name"].endswith(".pdf"))
.filter(col("file_name").str.endswith(".pdf"))

Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@bveeramani bveeramani enabled auto-merge (squash) October 22, 2025 00:01
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025
@bveeramani bveeramani merged commit b24efee into master Oct 22, 2025
6 of 7 checks passed
@bveeramani bveeramani deleted the benchmark-update-ray-data-canonical branch October 22, 2025 00:28
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
…API (#57977)

## Summary

This PR updates the document embedding benchmark to use the canonical
Ray Data implementation pattern, following best practices for the
framework.

## Key Changes

### Use `download()` expression instead of separate materialization
**Before:**
```python
file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)
```

**After:**
```python
(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))
```

This change:
- Eliminates the intermediate materialization with `take_all()`, which
loads all data into memory
- Uses the `download()` expression to lazily fetch file contents as part
of the pipeline
- Removes the need for a separate `read_binary_files()` call

### Method chaining for cleaner code
All operations are now chained in a single pipeline, making the data
flow more clear and idiomatic.

### Consistent column naming
Updated references from `path` to `uploaded_pdf_path` throughout the
code for consistency with the source data schema.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…API (ray-project#57977)

## Summary

This PR updates the document embedding benchmark to use the canonical
Ray Data implementation pattern, following best practices for the
framework.

## Key Changes

### Use `download()` expression instead of separate materialization
**Before:**
```python
file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)
```

**After:**
```python
(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))
```

This change:
- Eliminates the intermediate materialization with `take_all()`, which
loads all data into memory
- Uses the `download()` expression to lazily fetch file contents as part
of the pipeline
- Removes the need for a separate `read_binary_files()` call

### Method chaining for cleaner code
All operations are now chained in a single pipeline, making the data
flow more clear and idiomatic.

### Consistent column naming
Updated references from `path` to `uploaded_pdf_path` throughout the
code for consistency with the source data schema.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…API (ray-project#57977)

## Summary

This PR updates the document embedding benchmark to use the canonical
Ray Data implementation pattern, following best practices for the
framework.

## Key Changes

### Use `download()` expression instead of separate materialization
**Before:**
```python
file_paths = (
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .take_all()
)
file_paths = [row["uploaded_pdf_path"] for row in file_paths]
ds = ray.data.read_binary_files(file_paths, include_paths=True)
```

**After:**
```python
(
    ray.data.read_parquet(INPUT_PATH)
    .filter(lambda row: row["file_name"].endswith(".pdf"))
    .with_column("bytes", download("uploaded_pdf_path"))
```

This change:
- Eliminates the intermediate materialization with `take_all()`, which
loads all data into memory
- Uses the `download()` expression to lazily fetch file contents as part
of the pipeline
- Removes the need for a separate `read_binary_files()` call

### Method chaining for cleaner code
All operations are now chained in a single pipeline, making the data
flow more clear and idiomatic.

### Consistent column naming
Updated references from `path` to `uploaded_pdf_path` throughout the
code for consistency with the source data schema.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants