[Data] Ordering of blocks after map and map_batches #50890

jakac · 2025-02-25T09:34:49Z

What happened + What you expected to happen

I have a dataset in files, rows are sorted within files and files are sorted by name in the same folder. After applying map or map_batches, the order of blocks is different than before applying map or map_batches.

Expected behavior: map and map_batches doesn't change the ordering of rows.

Versions / Dependencies

ray==2.42.1

Reproduction script

import time

import ray

ray.init()

dataset = ray.data.from_items(
        [
            {"time_to_sleep": 3},
            {"time_to_sleep": 2},
            {"time_to_sleep": 1},
        ],
    override_num_blocks=3
)

print(dataset.take_all())
# output: [{'time_to_sleep': 3}, {'time_to_sleep': 2}, {'time_to_sleep': 1}]

def map_simple(x):
    time.sleep(x['time_to_sleep'])
    return x

print(dataset.map(map_simple).take_all())
# output: [{'time_to_sleep': 1}, {'time_to_sleep': 2}, {'time_to_sleep': 3}]

def my_map_batches(x):
    time.sleep(x['time_to_sleep'][0])
    yield {'result': [x['time_to_sleep'][0]]}

mapped = dataset.map_batches(my_map_batches)

print(mapped.take_all())
# output: [{'result': 1}, {'result': 2}, {'result': 3}]

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jakac · 2025-02-25T09:37:06Z

Interestingly, it works as expected if I sort the dataset first:

dataset = dataset.sort("time_to_sleep", descending=True)

But it doesn't solve the issue because sorting key might not be present in the data.

richardliaw · 2025-02-25T16:53:52Z

You can try setting preserve_order to get what you're looking for -- https://docs.ray.io/en/latest/data/api/doc/ray.data.ExecutionOptions.html#ray.data.ExecutionOptions.preserve_order

richardliaw · 2025-02-25T16:54:13Z

Overall this shouldn't be default behavior as it has significant implications at scale

jakac · 2025-02-26T09:22:23Z

Indeed setting

ray.data.DataContext.get_current().execution_options.preserve_order = True

resolves the issue. Reducing severity to Low.

Now the issue is about documentation, I believe this behavior deserves a warning block somewhere visible. What do you think about https://docs.ray.io/en/latest/data/transforming-data.html below the "Transformations are lazy" note?

richardliaw · 2025-02-27T18:59:26Z

How about just adding a section about "Ordering rows" in transforming data, and describing:

Generally order is not preserved
You can sort the dataset in order to get a deterministic order, but this adds an extra sort step
You can also specify the above preserve_order flag.

I don't think it needs to be a top level warning block, since i think many other systems (distributed) have this behavior

TechShivvy · 2025-02-27T21:09:33Z

Is setting the preserve_order flag the only efficient solution, or is there another function that inherently preserves order instead of map_batches? Sorting the results afterward is one option, but since it requires an explicit step, I’d rather not consider that.

Also, you mentioned:

Overall, this shouldn’t be the default behavior as it has significant implications at scale.

Could you elaborate on what those implications are?
@richardliaw

Edit: In my case, I’m splitting a sequential data into smaller parts using map_batches and running predictions on them. When I visualized the results, I noticed that the predicted segments were shuffled/shifted/jumbled clearly. So I want to maintain the order in the predictions.

Edit 2: I also tried handling the same task using ray.remote instead of map_batchesfor small sets of data. I manually split the batches, ran predictions, and used ray.get, which maintained the order correctly. However, I’m not sure about the trade-offs between ray.remote and map_batches. I checked with Chrome Tracing, and both seemed to take a similar amount of time. Given that I’m working with a large dataset, I’d like to understand the pros and cons of each approach.

TIA!

richardliaw · 2025-03-04T01:14:36Z

Is setting the preserve_order flag the only efficient solution, or is there another function that inherently preserves order instead of map_batches?

Setting the flag is recommended. I think in your case you won't see much difference if your workload is small. However if you have a ton of nodes (say, 20+) and a lot of stragglers, you can end up stalling the dataset processing if you enforce ordering.

TechShivvy · 2025-03-04T02:10:08Z

Thanks for the reply @richardliaw.

That's disappointing to hear, as my workload/data will be large. Should I conclude that ray.remote would be more suitable in my case, or is there an optimal saturation point for map_batches that can be achieved by adjusting batch size and concurrency settings? I'm basically trying to do timeseries forecasting.

richardliaw · 2025-03-04T05:34:45Z

I think it's worthwhile just trying the preserve_order option for now -- there's a reasonable chance it is not performance-impacting for your use case.

## Why are these changes needed? Adds a section about impact of operators on ordering of rows to docs. ## Related issue number Closes #50890 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( - [x] Built docs locally and verified the format and the links --------- Signed-off-by: jakac <matej.jakimov@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Abrar Sheikh <abrar@anyscale.com>

## Why are these changes needed? Adds a section about impact of operators on ordering of rows to docs. ## Related issue number Closes ray-project#50890 ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( - [x] Built docs locally and verified the format and the links --------- Signed-off-by: jakac <matej.jakimov@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

jakac added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 25, 2025

jakac changed the title ~~[Data] Ordering of blocks and map_batches~~ [Data] Ordering of blocks after map and map_batches Feb 25, 2025

gvspraveen removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Feb 25, 2025

jakac mentioned this issue Feb 28, 2025

[Docs][Data] Ordering of rows #50986

Merged

9 tasks

richardliaw closed this as completed in #50986 Mar 5, 2025

richardliaw closed this as completed in a524ef0 Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Ordering of blocks after map and map_batches #50890

[Data] Ordering of blocks after map and map_batches #50890

jakac commented Feb 25, 2025 •

edited

Loading

jakac commented Feb 25, 2025

richardliaw commented Feb 25, 2025

richardliaw commented Feb 25, 2025

jakac commented Feb 26, 2025

richardliaw commented Feb 27, 2025

TechShivvy commented Feb 27, 2025 •

edited

Loading

richardliaw commented Mar 4, 2025

TechShivvy commented Mar 4, 2025 •

edited

Loading

richardliaw commented Mar 4, 2025

[Data] Ordering of blocks after map and map_batches #50890

[Data] Ordering of blocks after map and map_batches #50890

Comments

jakac commented Feb 25, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jakac commented Feb 25, 2025

richardliaw commented Feb 25, 2025

richardliaw commented Feb 25, 2025

jakac commented Feb 26, 2025

richardliaw commented Feb 27, 2025

TechShivvy commented Feb 27, 2025 • edited Loading

richardliaw commented Mar 4, 2025

TechShivvy commented Mar 4, 2025 • edited Loading

richardliaw commented Mar 4, 2025

jakac commented Feb 25, 2025 •

edited

Loading

TechShivvy commented Feb 27, 2025 •

edited

Loading

TechShivvy commented Mar 4, 2025 •

edited

Loading