-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Data] Add wiring for multiple download URIs #57775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully adds support for downloading from multiple URI columns in a single Download operation. The changes to the logical and physical operators are well-structured. I've identified a potential ZeroDivisionError in the partitioning logic that should be addressed to prevent runtime failures. Additionally, I've included a suggestion to improve download performance by parallelizing downloads across all URI columns, rather than processing them sequentially, which could be a valuable future enhancement.
| # Download each URI column and add it to the output block | ||
| for uri_column_name, output_bytes_column_name in zip( | ||
| uri_column_names, output_bytes_column_names | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation downloads files for each URI column sequentially. While downloads within a single column are concurrent, the processing of different URI columns happens one after another. This might not be the most efficient approach, especially if there are many URI columns.
For better performance, consider downloading files from all URI columns concurrently. This would involve collecting all URIs from all columns, downloading them in a single make_async_gen call, and then reconstructing the output columns. This would better utilize the available download workers and align with the PR's goal to 'efficiently handle multiple URI columns simultaneously'.
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| target_nbytes_per_partition / avg_nbytes_per_row | ||
| ) | ||
| return nrows_per_partition | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: PartitionActor Fails with Multiple URIs
The PartitionActor incorrectly estimates row sizes for multiple URI columns. The _sample_sizes method uses as_completed, which reorders results and filters out failures. This causes zip to incorrectly pair file sizes from different rows and truncate lists, leading to inaccurate partitioning and potential ZeroDivisionError.
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: xgui <xgui@anyscale.com>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
<!-- Thank you for contributing to Ray! 🚀 --> <!-- Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete --> ## Description <!-- Briefly describe what this PR accomplishes and why it's needed --> Add support for downloading from multiple URI columns in a single Download operation. This enhancement extends Ray Data's download functionality to efficiently handle multiple URI columns simultaneously, reducing the number of operations needed when working with datasets that contain multiple file references. ## Related issues <!-- Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234" --> ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [x] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes⚠️ - [x] No <!-- If yes, describe what breaks and how users should migrate --> **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context <!-- Optional: Add screenshots, examples, performance impact, breaking change details --> --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Description
This PR updates the Download logical interface to handle multiple URI columns internally. It adds the necessary wiring for multi-URI downloads, but does not expose any new functionality to users yet.
Related issues
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context