[Datasets] Add feature guide section for configuring `batch_size` in `.map_batches()` #29117

clarkzinzow · 2022-10-06T16:51:12Z

Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.

TODOs

(Follow-up, after batch size is configurable for preprocessors) Add code example showing parallelism vs. batch size, where batch size is tweaked?

Related issue number

Closes #29116

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

doc/source/data/transforming-datasets.rst

matthewdeng · 2022-10-07T06:02:54Z

doc/source/data/transforming-datasets.rst

+Datasets will also bundle multiple blocks together for a single mapper task in order
+to better satisfy ``batch_size``, so if ``batch_size`` is a lot larger than your Dataset
+blocks (e.g. if your dataset was created with too large of a ``parallelism`` and/or the
+``batch_size`` is set to too large of a value), the number of parallel mapper tasks may
+be less than expected and you may be leaving some throughput on the table.


Is this just referring to block coalescing? If so, won't the number of mapper tasks only be limited by batch_size and not the block size when parallelism is large?

num_tasks = num_rows/max(batch_size, block_size)

In other words, it's not really that the parallelism is too large, but still that the batch size is too large.

If so, won't the number of mapper tasks only be limited by batch_size and not the block size when parallelism is large?

Hmm the parallelism determines the block size, so that's a bit of a tautology.

In other words, it's not really that the parallelism is too large, but still that the batch size is too large.

But the default parallelism can be set to a large value (relative to the batch size) automatically by Datasets, so if the user is relying on the default parallelism but is manually setting the batch size (e.g. because they have it tailored to the batch size that's needed to saturate a GPU for prediction), the parallelism may be the knob that they need to tune.

@matthewdeng Does that make sense? Any ideas for how we can improve this section? Otherwise I'm planning on merging this.

Ah reading this over I think that I was referring to the last line:

the number of parallel mapper tasks may be less than expected

In which case lowering parallelism (or increasing block_size) won't increase the number of parallel tasks?

This is more about making the transformation-time parallelism respect the read-time parallelism argument than increasing transformation-time parallelism, so maybe this doesn't need a call out?

The user expects the number of parallel mapper tasks to be what they give for read parallelism, but if parallelism is set too high, this can result in blocks that are much smaller than batch_size, and since we bundle blocks to try to achieve a full batch_size batch in our map tasks, the map parallelism can unexpectedly be a lot smaller than the read parallelism, which is what we're trying to call out here. All that decreasing parallelism would do in this case is result make the read-time parallelism more closely match the transformation-time parallelism.

I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size. I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.

I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.

I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size.

The commonly expected behavior I've heard from users is that their transformation parallelism is equal to their read parallelism, so this paragraph is mostly explaining that discrepancy for why those might not match.

I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.

The action item of decreasing the batch size and/or decreasing the parallelism is somewhat implied by the "too large" comment, but since this is trying to alert about a possibly surprising discrepancy, not sure if we need a super explicit action item. 🤔

I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.

I'm going to leave this as-is for now so we have something in the docs to point to when talking about this parallelism discrepancy, but agreed that we might want to revisit this. I'm thinking that a lot of this feature guide will get reworked with the new execution planner + model, so we can revisit this at that time.

doc/source/data/transforming-datasets.rst

clarkzinzow · 2022-11-14T14:36:29Z

@matthewdeng Ping for another review, I updated the feature guide section after @c21's PR: #29971

cc @ray-project/ray-docs for code-owner review

c21

LGTM with some minor comments.

doc/source/data/transforming-datasets.rst

jianoaix · 2022-11-14T19:52:57Z

doc/source/data/transforming-datasets.rst

+  :start-after: __configuring_batch_size_begin__
+  :end-before: __configuring_batch_size_end__
+
+Increasing ``batch_size`` can result in faster execution by better leveraging SIMD


If just for SIMD, batch_size actually needs to be small enough to fit the CPU cache right?

I should probably reword this to "better leveraging vectorized operations and hardware" or something. There's the SIMD level of vectorization and there's also the native-core level of vectorization, where the latter refers to a data batch that's processed in a tight C loop, in the native-core of a particular data processing library (e.g. NumPy, Arrow, TensorFlow, Torch).

Along with the baseline universal truth that "looping in a native language is a lot faster than looping in Python", there are also a few different levels of hardware constraints here:

SIMD: parallelized operation over 256 bits of the batch - 4 64-bit elements

CPU cache line: 64 bytes - 8 64-bit elements

CPU caches: 32 KiB L1 cache - 4096 64-bit elements, 256 KiB L2 cache - 32768 64-bit elements, 16 MiB - 2097152 64-bit elements

Assuming that the CPU does some quite aggressive cache line prefetching into the CPU caches when it detects our sequential/strided access, I would imagine that the stream of elements into the e.g. AVX2 instructions remains pretty hot.

So increasing from a batch size of 1 to 4 could lead to better SIMD fulfillment, increasing from 4 to 8 could lead to better cache line fulfillment, and increasing from 8 to e.g. 32, 64, 128 may better line up with cache line prefetching. It's difficult to be concrete without profiling a particular workload and setup, so maybe saying something like "better leveraging vectorized operations and hardware" is a good high-level umbrella statement.

stale · 2022-12-17T06:23:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

…n ``.map_batches()`` (#29117) Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.

…n ``.map_batches()`` (ray-project#29117) Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide. Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

clarkzinzow requested review from ericl, scv119, jjyao, jianoaix, maxpumperla, c21 and a team as code owners October 6, 2022 16:51

matthewdeng approved these changes Oct 7, 2022

View reviewed changes

clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 55af475 to 133ded7 Compare October 21, 2022 22:35

clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 133ded7 to 6ebe8ac Compare November 14, 2022 14:35

clarkzinzow assigned matthewdeng, c21, jianoaix and ericl Nov 14, 2022

clarkzinzow requested a review from matthewdeng November 14, 2022 18:52

c21 approved these changes Nov 14, 2022

View reviewed changes

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved

jianoaix reviewed Nov 14, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 17, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 17, 2022

clarkzinzow added 3 commits December 19, 2022 18:03

Add feature guide section for configuring batch size for map_batches.

aa54981

PR feedback.

46fab33

PR feedback

e3fe1e2

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 19, 2022

PR feedback.

3f9e686

clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 6ebe8ac to 3f9e686 Compare December 20, 2022 20:30

clarkzinzow merged commit a667019 into ray-project:master Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Add feature guide section for configuring `batch_size` in `.map_batches()` #29117

[Datasets] Add feature guide section for configuring `batch_size` in `.map_batches()` #29117

clarkzinzow commented Oct 6, 2022 •

edited

Loading

matthewdeng Oct 7, 2022

clarkzinzow Oct 21, 2022 •

edited

Loading

clarkzinzow Oct 24, 2022

matthewdeng Nov 14, 2022

clarkzinzow Dec 19, 2022

matthewdeng Dec 19, 2022

clarkzinzow Dec 22, 2022

clarkzinzow commented Nov 14, 2022

c21 left a comment

jianoaix Nov 14, 2022

clarkzinzow Dec 20, 2022 •

edited

Loading

stale bot commented Dec 17, 2022

[Datasets] Add feature guide section for configuring batch_size in .map_batches() #29117

[Datasets] Add feature guide section for configuring batch_size in .map_batches() #29117

Conversation

clarkzinzow commented Oct 6, 2022 • edited Loading

TODOs

Related issue number

Checks

matthewdeng Oct 7, 2022

Choose a reason for hiding this comment

clarkzinzow Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Oct 24, 2022

Choose a reason for hiding this comment

matthewdeng Nov 14, 2022

Choose a reason for hiding this comment

clarkzinzow Dec 19, 2022

Choose a reason for hiding this comment

matthewdeng Dec 19, 2022

Choose a reason for hiding this comment

clarkzinzow Dec 22, 2022

Choose a reason for hiding this comment

clarkzinzow commented Nov 14, 2022

c21 left a comment

Choose a reason for hiding this comment

jianoaix Nov 14, 2022

Choose a reason for hiding this comment

clarkzinzow Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

stale bot commented Dec 17, 2022

[Datasets] Add feature guide section for configuring `batch_size` in `.map_batches()` #29117

[Datasets] Add feature guide section for configuring `batch_size` in `.map_batches()` #29117

clarkzinzow commented Oct 6, 2022 •

edited

Loading

clarkzinzow Oct 21, 2022 •

edited

Loading

clarkzinzow Dec 20, 2022 •

edited

Loading