-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Only enable blocks bundling when batch_size is set #29971
Conversation
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
python/ray/data/context.py
Outdated
@@ -82,7 +82,7 @@ | |||
OK_PREFIX = "✔️ " | |||
|
|||
# Default batch size for batch transformations. | |||
DEFAULT_BATCH_SIZE = 4096 | |||
DEFAULT_BATCH_SIZE = "4096" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the context-level default being a string that's eventually int-casted is a bit hacky, could we keep this as the 4096
int and have a "default"
sentinel as the default arg in the .map_batches()
signature, where we then do the check within the function?
def map_batches(
...,
batch_size = "default",
...,
):
...
if batch_size == "default":
batch_size = DEFAULT_BATCH_SIZE
elif batch_size is not None:
if batch_size < 1:
raise ValueError("Batch size cannot be negative or 0")
# Enable blocks bundling when batch_size is specified by caller.
target_block_size = batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow - yeah makes sense, I will make the change. Thanks.
Could we also update the message about block bundling to reflect the new behavior? Before:
Suggested after:
I think decreasing read parallelism is probably secondary to modifying the batch size. Also, should we recommend a way to check whether the bundling happened in the message? If there is a specific line in Dataset.stats() to point the user to, that would be great. |
Signed-off-by: Cheng Su <scnju13@gmail.com>
python/ray/data/_internal/compute.py
Outdated
f"`batch_size` is set to {target_size}, which reduces parallelism from " | ||
f"{len(blocks)} to {len(block_bundles)}. If the performance is worse than " | ||
"expected, this may indicate that batch size is too large or input block " | ||
"size is too small. To reduce batch size, consider to decrease " | ||
"`batch_size` or use the default in map_batches. To increase input block " | ||
"size, consider to decrease `parallelism` in read." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @stephanie-wang for suggestion. Please double check error message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, really like that you added the before and after parallelism! Suggested some minor grammar changes.
python/ray/data/_internal/compute.py
Outdated
"expected, this may indicate that batch size is too large or input block " | ||
"size is too small. To reduce batch size, consider to decrease " | ||
"`batch_size` or use the default in map_batches. To increase input block " | ||
"size, consider to decrease `parallelism` in read." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"expected, this may indicate that batch size is too large or input block " | |
"size is too small. To reduce batch size, consider to decrease " | |
"`batch_size` or use the default in map_batches. To increase input block " | |
"size, consider to decrease `parallelism` in read." | |
"expected, this may indicate that the batch size is too large or the input block " | |
"size is too small. To reduce batch size, consider decreasing " | |
"`batch_size` or use the default in `map_batches`. To increase input block " | |
"size, consider decreasing `parallelism` in read." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephanie-wang - updated.
@@ -323,7 +323,7 @@ def map_batches( | |||
self, | |||
fn: BatchUDF, | |||
*, | |||
batch_size: Optional[int] = DEFAULT_BATCH_SIZE, | |||
batch_size: Optional[Union[int, Literal["default"]]] = "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the documentation of batch_size
below as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - could you suggest the specific documentation to be updated? It already mentions Defaults to 4096
. Feel quite clear for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this introduced a new default value "default", right? We need to document what does that do to the semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought Defaults to 4096
is clear, but updated anyway. Could you help double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestions as Stephanie, after those are added, this LGTM!
Note to myself: I'll need to remember to tweak the batch_size
section of the "Transforming Datasets" feature guide that I added in this PR: #29117
Signed-off-by: Cheng Su <scnju13@gmail.com>
…ch `Dataset` (#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In #29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default.
…oject#29971) Before this PR, we always enable blocks bundling in map_batches, to bundle small blocks together for the given batch_size. This is good for batch prediction on GPU, but not good for CPU preprocessing with default batch size (4096, which is too large). So here we decide to disable blocks bundling by default, and only enable blocks bundling when user specifies batch_size. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ch `Dataset` (ray-project#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In ray-project#29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default. Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…ch `Dataset` (ray-project#30960) Signed-off-by: amogkam amogkamsetty@yahoo.com In ray-project#29971, we disabled block coalescing by default and changed the default batch_size value for map_batches. However, this same logic did not get carried over to DatasetPipeline, meaning DatasetPipeline.map_batches has block coalescing on by default. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
… with default batch_size (#47433) When batch_size is not set, input blocks are will be not bundled up. Add a comment explaining this. See #29971 and #47363 (comment) Signed-off-by: Hao Chen <chenh1024@gmail.com>
… with default batch_size (ray-project#47433) When batch_size is not set, input blocks are will be not bundled up. Add a comment explaining this. See ray-project#29971 and ray-project#47363 (comment) Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Signed-off-by: Cheng Su scnju13@gmail.com
Why are these changes needed?
Before this PR, we always enable blocks bundling in
map_batches
, to bundle small blocks together for the givenbatch_size
. This is good for batch prediction on GPU, but not good for CPU preprocessing with default batch size (4096, which is too large). So here we decide to disable blocks bundling by default, and only enable blocks bundling when user specifiesbatch_size
. See this doc for full discussion.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.