-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Ordering of blocks after map and map_batches #50890
Comments
Interestingly, it works as expected if I sort the dataset first:
But it doesn't solve the issue because sorting key might not be present in the data. |
You can try setting preserve_order to get what you're looking for -- https://docs.ray.io/en/latest/data/api/doc/ray.data.ExecutionOptions.html#ray.data.ExecutionOptions.preserve_order |
Overall this shouldn't be default behavior as it has significant implications at scale |
Indeed setting
resolves the issue. Reducing severity to Low. Now the issue is about documentation, I believe this behavior deserves a warning block somewhere visible. What do you think about https://docs.ray.io/en/latest/data/transforming-data.html below the "Transformations are lazy" note? |
How about just adding a section about "Ordering rows" in transforming data, and describing:
I don't think it needs to be a top level warning block, since i think many other systems (distributed) have this behavior |
Is setting the Also, you mentioned:
Could you elaborate on what those implications are? Edit: In my case, I’m splitting a sequential data into smaller parts using Edit 2: I also tried handling the same task using TIA! |
Setting the flag is recommended. I think in your case you won't see much difference if your workload is small. However if you have a ton of nodes (say, 20+) and a lot of stragglers, you can end up stalling the dataset processing if you enforce ordering. |
Thanks for the reply @richardliaw. That's disappointing to hear, as my workload/data will be large. Should I conclude that |
I think it's worthwhile just trying the preserve_order option for now -- there's a reasonable chance it is not performance-impacting for your use case. |
## Why are these changes needed? Adds a section about impact of operators on ordering of rows to docs. ## Related issue number Closes #50890 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( - [x] Built docs locally and verified the format and the links --------- Signed-off-by: jakac <matej.jakimov@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Abrar Sheikh <abrar@anyscale.com>
## Why are these changes needed? Adds a section about impact of operators on ordering of rows to docs. ## Related issue number Closes ray-project#50890 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( - [x] Built docs locally and verified the format and the links --------- Signed-off-by: jakac <matej.jakimov@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
What happened + What you expected to happen
I have a dataset in files, rows are sorted within files and files are sorted by name in the same folder. After applying
map
ormap_batches
, the order of blocks is different than before applyingmap
ormap_batches
.Expected behavior:
map
andmap_batches
doesn't change the ordering of rows.Versions / Dependencies
ray==2.42.1
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: