Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Oct 16, 2025

Description

This PR updates the Download logical interface to handle multiple URI columns internally. It adds the necessary wiring for multi-URI downloads, but does not expose any new functionality to users yet.

Related issues

Types of change

  • Bug fix 🐛
  • New feature ✨
  • Enhancement 🚀
  • Code refactoring 🔧
  • Documentation update 📖
  • Chore 🧹
  • Style 🎨

Checklist

Does this PR introduce breaking changes?

  • Yes ⚠️
  • No

Testing:

  • Added/updated tests for my changes
  • Tested the changes manually
  • This PR is not tested ❌ (please explain why)

Code Quality:

  • Signed off every commit (git commit -s)
  • Ran pre-commit hooks (setup guide)

Documentation:

  • Updated documentation (if applicable) (contribution guide)
  • Added new APIs to doc/source/ (if applicable)

Additional context

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner October 16, 2025 01:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds support for downloading from multiple URI columns in a single Download operation. The changes to the logical and physical operators are well-structured. I've identified a potential ZeroDivisionError in the partitioning logic that should be addressed to prevent runtime failures. Additionally, I've included a suggestion to improve download performance by parallelizing downloads across all URI columns, rather than processing them sequentially, which could be a valuable future enhancement.

Comment on lines +173 to +176
# Download each URI column and add it to the output block
for uri_column_name, output_bytes_column_name in zip(
uri_column_names, output_bytes_column_names
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation downloads files for each URI column sequentially. While downloads within a single column are concurrent, the processing of different URI columns happens one after another. This might not be the most efficient approach, especially if there are many URI columns.

For better performance, consider downloading files from all URI columns concurrently. This would involve collecting all URIs from all columns, downloading them in a single make_async_gen call, and then reconstructing the output columns. This would better utilize the available download workers and align with the PR's goal to 'efficiently handle multiple URI columns simultaneously'.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Oct 16, 2025
@bveeramani bveeramani enabled auto-merge (squash) October 16, 2025 01:32
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 16, 2025
target_nbytes_per_partition / avg_nbytes_per_row
)
return nrows_per_partition

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PartitionActor Fails with Multiple URIs

The PartitionActor incorrectly estimates row sizes for multiple URI columns. The _sample_sizes method uses as_completed, which reorders results and filters out failures. This causes zip to incorrectly pair file sizes from different rows and truncate lists, leading to inaccurate partitioning and potential ZeroDivisionError.

Fix in Cursor Fix in Web

@bveeramani bveeramani merged commit d2affd2 into master Oct 16, 2025
6 of 8 checks passed
@bveeramani bveeramani deleted the multiple-uris branch October 16, 2025 02:12
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
<!-- Thank you for contributing to Ray! 🚀 -->
<!-- Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- 💡 Tip: Mark as draft if you want early feedback, or ready for
review when it's complete -->

## Description

<!-- Briefly describe what this PR accomplishes and why it's needed -->

Add support for downloading from multiple URI columns in a single
Download operation. This enhancement extends Ray Data's download
functionality to efficiently
handle multiple URI columns simultaneously, reducing the number of
operations needed when working with datasets that contain multiple file
references.
## Related issues

<!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234" -->

## Types of change

- [ ] Bug fix 🐛
- [ ] New feature ✨
- [ ] Enhancement 🚀
- [x] Code refactoring 🔧
- [ ] Documentation update 📖
- [ ] Chore 🧹
- [ ] Style 🎨

## Checklist

**Does this PR introduce breaking changes?**
- [ ] Yes ⚠️
- [x] No
<!-- If yes, describe what breaks and how users should migrate -->

**Testing:**
- [ ] Added/updated tests for my changes
- [x] Tested the changes manually
- [ ] This PR is not tested ❌ _(please explain why)_

**Code Quality:**
- [ ] Signed off every commit (`git commit -s`)
- [x] Ran pre-commit hooks ([setup
guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))

**Documentation:**
- [ ] Updated documentation (if applicable) ([contribution
guide](https://docs.ray.io/en/latest/ray-contribute/docs.html))
- [ ] Added new APIs to `doc/source/` (if applicable)

## Additional context

<!-- Optional: Add screenshots, examples, performance impact, breaking
change details -->

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants