-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add feature guide section for configuring batch_size
in .map_batches()
#29117
[Datasets] Add feature guide section for configuring batch_size
in .map_batches()
#29117
Conversation
Datasets will also bundle multiple blocks together for a single mapper task in order | ||
to better satisfy ``batch_size``, so if ``batch_size`` is a lot larger than your Dataset | ||
blocks (e.g. if your dataset was created with too large of a ``parallelism`` and/or the | ||
``batch_size`` is set to too large of a value), the number of parallel mapper tasks may | ||
be less than expected and you may be leaving some throughput on the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just referring to block coalescing? If so, won't the number of mapper tasks only be limited by batch_size
and not the block size when parallelism is large?
num_tasks = num_rows/max(batch_size, block_size)
In other words, it's not really that the parallelism is too large, but still that the batch size is too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, won't the number of mapper tasks only be limited by batch_size and not the block size when parallelism is large?
Hmm the parallelism determines the block size, so that's a bit of a tautology.
In other words, it's not really that the parallelism is too large, but still that the batch size is too large.
But the default parallelism can be set to a large value (relative to the batch size) automatically by Datasets, so if the user is relying on the default parallelism but is manually setting the batch size (e.g. because they have it tailored to the batch size that's needed to saturate a GPU for prediction), the parallelism
may be the knob that they need to tune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matthewdeng Does that make sense? Any ideas for how we can improve this section? Otherwise I'm planning on merging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah reading this over I think that I was referring to the last line:
the number of parallel mapper tasks may be less than expected
In which case lowering parallelism
(or increasing block_size
) won't increase the number of parallel tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more about making the transformation-time parallelism respect the read-time parallelism
argument than increasing transformation-time parallelism, so maybe this doesn't need a call out?
The user expects the number of parallel mapper tasks to be what they give for read parallelism
, but if parallelism
is set too high, this can result in blocks that are much smaller than batch_size
, and since we bundle blocks to try to achieve a full batch_size
batch in our map tasks, the map parallelism can unexpectedly be a lot smaller than the read parallelism
, which is what we're trying to call out here. All that decreasing parallelism
would do in this case is result make the read-time parallelism more closely match the transformation-time parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size. I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.
I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size.
The commonly expected behavior I've heard from users is that their transformation parallelism is equal to their read parallelism, so this paragraph is mostly explaining that discrepancy for why those might not match.
I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.
The action item of decreasing the batch size and/or decreasing the parallelism is somewhat implied by the "too large" comment, but since this is trying to alert about a possibly surprising discrepancy, not sure if we need a super explicit action item. 🤔
I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.
I'm going to leave this as-is for now so we have something in the docs to point to when talking about this parallelism discrepancy, but agreed that we might want to revisit this. I'm thinking that a lot of this feature guide will get reworked with the new execution planner + model, so we can revisit this at that time.
55af475
to
133ded7
Compare
133ded7
to
6ebe8ac
Compare
@matthewdeng Ping for another review, I updated the feature guide section after @c21's PR: #29971 cc @ray-project/ray-docs for code-owner review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some minor comments.
:start-after: __configuring_batch_size_begin__ | ||
:end-before: __configuring_batch_size_end__ | ||
|
||
Increasing ``batch_size`` can result in faster execution by better leveraging SIMD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If just for SIMD, batch_size actually needs to be small enough to fit the CPU cache right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably reword this to "better leveraging vectorized operations and hardware" or something. There's the SIMD level of vectorization and there's also the native-core level of vectorization, where the latter refers to a data batch that's processed in a tight C loop, in the native-core of a particular data processing library (e.g. NumPy, Arrow, TensorFlow, Torch).
Along with the baseline universal truth that "looping in a native language is a lot faster than looping in Python", there are also a few different levels of hardware constraints here:
- SIMD: parallelized operation over 256 bits of the batch - 4 64-bit elements
- CPU cache line: 64 bytes - 8 64-bit elements
- CPU caches: 32 KiB L1 cache - 4096 64-bit elements, 256 KiB L2 cache - 32768 64-bit elements, 16 MiB - 2097152 64-bit elements
Assuming that the CPU does some quite aggressive cache line prefetching into the CPU caches when it detects our sequential/strided access, I would imagine that the stream of elements into the e.g. AVX2 instructions remains pretty hot.
So increasing from a batch size of 1 to 4 could lead to better SIMD fulfillment, increasing from 4 to 8 could lead to better cache line fulfillment, and increasing from 8 to e.g. 32, 64, 128 may better line up with cache line prefetching. It's difficult to be concrete without profiling a particular workload and setup, so maybe saying something like "better leveraging vectorized operations and hardware" is a good high-level umbrella statement.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
6ebe8ac
to
3f9e686
Compare
…n ``.map_batches()`` (#29117) Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.
…n ``.map_batches()`` (ray-project#29117) Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Users may need guidance on configuring
batch_size
fords.map_batches()
, since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuringbatch_size
to the "Transforming Datasets" feature guide.TODOs
Related issue number
Closes #29116
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.