[Datasets] Only enable blocks bundling when batch_size is set #29971

c21 · 2022-11-03T07:47:38Z

Signed-off-by: Cheng Su scnju13@gmail.com

Why are these changes needed?

Before this PR, we always enable blocks bundling in map_batches, to bundle small blocks together for the given batch_size. This is good for batch prediction on GPU, but not good for CPU preprocessing with default batch size (4096, which is too large). So here we decide to disable blocks bundling by default, and only enable blocks bundling when user specifies batch_size. See this doc for full discussion.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <scnju13@gmail.com>

clarkzinzow · 2022-11-08T14:25:17Z

python/ray/data/context.py

@@ -82,7 +82,7 @@
 OK_PREFIX = "✔️ "

 # Default batch size for batch transformations.
-DEFAULT_BATCH_SIZE = 4096
+DEFAULT_BATCH_SIZE = "4096"


Since the context-level default being a string that's eventually int-casted is a bit hacky, could we keep this as the 4096 int and have a "default" sentinel as the default arg in the .map_batches() signature, where we then do the check within the function?

def map_batches( ..., batch_size = "default", ..., ): ... if batch_size == "default": batch_size = DEFAULT_BATCH_SIZE elif batch_size is not None: if batch_size < 1: raise ValueError("Batch size cannot be negative or 0") # Enable blocks bundling when batch_size is specified by caller. target_block_size = batch_size

@clarkzinzow - yeah makes sense, I will make the change. Thanks.

stephanie-wang · 2022-11-08T19:06:54Z

Could we also update the message about block bundling to reflect the new behavior?

Before:

Having to send 10 or more blocks to a single read->map_batches task to create a batch of size 4096, which may result in less transformation parallelism than expected. This may indicate that your blocks are too small and/or your batch size is too large, and you may want to decrease your read parallelism and/or decrease your batch size.

Suggested after:

`batch_size` is set to X, which requires sending X or more blocks to a single X task. This may reduce parallelism if you have fewer than X cores available. To ensure sufficient parallelism, decrease your map `batch_size` or use the default. You can also increase your block size by decreasing `parallelism` of the read stage.

I think decreasing read parallelism is probably secondary to modifying the batch size. Also, should we recommend a way to check whether the bundling happened in the message? If there is a specific line in Dataset.stats() to point the user to, that would be great.

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 · 2022-11-08T21:52:16Z

python/ray/data/_internal/compute.py

+            f"`batch_size` is set to {target_size}, which reduces parallelism from "
+            f"{len(blocks)} to {len(block_bundles)}. If the performance is worse than "
+            "expected, this may indicate that batch size is too large or input block "
+            "size is too small. To reduce batch size, consider to decrease "
+            "`batch_size` or use the default in map_batches. To increase input block "
+            "size, consider to decrease `parallelism` in read."


Thanks @stephanie-wang for suggestion. Please double check error message here.

Nice, really like that you added the before and after parallelism! Suggested some minor grammar changes.

stephanie-wang · 2022-11-08T21:59:26Z

python/ray/data/_internal/compute.py

+            "expected, this may indicate that batch size is too large or input block "
+            "size is too small. To reduce batch size, consider to decrease "
+            "`batch_size` or use the default in map_batches. To increase input block "
+            "size, consider to decrease `parallelism` in read."


Suggested change

"expected, this may indicate that batch size is too large or input block "

"size is too small. To reduce batch size, consider to decrease "

"`batch_size` or use the default in map_batches. To increase input block "

"size, consider to decrease `parallelism` in read."

"expected, this may indicate that the batch size is too large or the input block "

"size is too small. To reduce batch size, consider decreasing "

"`batch_size` or use the default in `map_batches`. To increase input block "

"size, consider decreasing `parallelism` in read."

@stephanie-wang - updated.

jianoaix · 2022-11-08T23:37:12Z

python/ray/data/dataset.py

@@ -323,7 +323,7 @@ def map_batches(
        self,
        fn: BatchUDF,
        *,
-        batch_size: Optional[int] = DEFAULT_BATCH_SIZE,
+        batch_size: Optional[Union[int, Literal["default"]]] = "default",


Can you update the documentation of batch_size below as well?

@jianoaix - could you suggest the specific documentation to be updated? It already mentions Defaults to 4096. Feel quite clear for end users.

So this introduced a new default value "default", right? We need to document what does that do to the semantics.

Thought Defaults to 4096 is clear, but updated anyway. Could you help double check?

clarkzinzow

Same suggestions as Stephanie, after those are added, this LGTM!

Note to myself: I'll need to remember to tweak the batch_size section of the "Transforming Datasets" feature guide that I added in this PR: #29117

Signed-off-by: Cheng Su <scnju13@gmail.com>

…ch `Dataset` (#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In #29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default.

…oject#29971) Before this PR, we always enable blocks bundling in map_batches, to bundle small blocks together for the given batch_size. This is good for batch prediction on GPU, but not good for CPU preprocessing with default batch size (4096, which is too large). So here we decide to disable blocks bundling by default, and only enable blocks bundling when user specifies batch_size. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ch `Dataset` (ray-project#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In ray-project#29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…ch `Dataset` (ray-project#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In ray-project#29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

… with default batch_size (#47433) When batch_size is not set, input blocks are will be not bundled up. Add a comment explaining this. See #29971 and #47363 (comment) Signed-off-by: Hao Chen <chenh1024@gmail.com>

… with default batch_size (ray-project#47433) When batch_size is not set, input blocks are will be not bundled up. Add a comment explaining this. See ray-project#29971 and ray-project#47363 (comment) Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>

c21 requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners November 3, 2022 07:47

Change default behavior of blocks bundling

5d82ad6

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 force-pushed the batch-size branch from 38fdbb0 to 5d82ad6 Compare November 7, 2022 20:55

c21 assigned stephanie-wang, clarkzinzow, amogkam and jianoaix Nov 7, 2022

c21 added 2 commits November 7, 2022 13:25

Move DefaultBatchSize as Dataset internal

739ba0a

Signed-off-by: Cheng Su <scnju13@gmail.com>

Change to use string as default batch size type

0276bd8

Signed-off-by: Cheng Su <scnju13@gmail.com>

clarkzinzow reviewed Nov 8, 2022

View reviewed changes

Address all comments

4fd141b

Signed-off-by: Cheng Su <scnju13@gmail.com>

c21 commented Nov 8, 2022

View reviewed changes

stephanie-wang reviewed Nov 8, 2022

View reviewed changes

stephanie-wang approved these changes Nov 8, 2022

View reviewed changes

jianoaix reviewed Nov 8, 2022

View reviewed changes

clarkzinzow approved these changes Nov 8, 2022

View reviewed changes

Address all comments

d3dc327

Signed-off-by: Cheng Su <scnju13@gmail.com>

jianoaix approved these changes Nov 9, 2022

View reviewed changes

Try to fix documentation

2f3a60a

c21 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 10, 2022

clarkzinzow merged commit acf996a into ray-project:master Nov 10, 2022

clarkzinzow mentioned this pull request Nov 14, 2022

[Datasets] Add feature guide section for configuring batch_size in .map_batches() #29117

Merged

8 tasks

c21 deleted the batch-size branch November 15, 2022 00:36

amogkam mentioned this pull request Dec 8, 2022

[Data] Update batch_size default value for DatasetPipeline to match Dataset #30960

Merged

7 tasks

bveeramani mentioned this pull request Mar 24, 2023

[docs] improve user experience of the API ref #33645

Open

This was referenced Aug 30, 2024

[data] Fix min_rows_per_bundled_input not correctly set with default batch_size #47363

Closed

[data] add a comment explaining the bundling behavior for map_batches with default batch_size #47433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Only enable blocks bundling when batch_size is set #29971

[Datasets] Only enable blocks bundling when batch_size is set #29971

c21 commented Nov 3, 2022

clarkzinzow Nov 8, 2022 •

edited

Loading

c21 Nov 8, 2022

stephanie-wang commented Nov 8, 2022

c21 Nov 8, 2022

stephanie-wang Nov 8, 2022

stephanie-wang Nov 8, 2022

c21 Nov 9, 2022

jianoaix Nov 8, 2022

c21 Nov 8, 2022

jianoaix Nov 9, 2022

c21 Nov 9, 2022

clarkzinzow left a comment •

edited

Loading

[Datasets] Only enable blocks bundling when batch_size is set #29971

[Datasets] Only enable blocks bundling when batch_size is set #29971

Conversation

c21 commented Nov 3, 2022

Why are these changes needed?

Related issue number

Checks

clarkzinzow Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Nov 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

clarkzinzow Nov 8, 2022 •

edited

Loading

clarkzinzow left a comment •

edited

Loading