Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add feature guide section for configuring batch_size in .map_batches() #29117

Conversation

clarkzinzow
Copy link
Contributor

@clarkzinzow clarkzinzow commented Oct 6, 2022

Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.

TODOs

  • (Follow-up, after batch size is configurable for preprocessors) Add code example showing parallelism vs. batch size, where batch size is tweaked?

Related issue number

Closes #29116

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
Comment on lines 328 to 332
Datasets will also bundle multiple blocks together for a single mapper task in order
to better satisfy ``batch_size``, so if ``batch_size`` is a lot larger than your Dataset
blocks (e.g. if your dataset was created with too large of a ``parallelism`` and/or the
``batch_size`` is set to too large of a value), the number of parallel mapper tasks may
be less than expected and you may be leaving some throughput on the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just referring to block coalescing? If so, won't the number of mapper tasks only be limited by batch_size and not the block size when parallelism is large?

num_tasks = num_rows/max(batch_size, block_size)

In other words, it's not really that the parallelism is too large, but still that the batch size is too large.

Copy link
Contributor Author

@clarkzinzow clarkzinzow Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, won't the number of mapper tasks only be limited by batch_size and not the block size when parallelism is large?

Hmm the parallelism determines the block size, so that's a bit of a tautology.

In other words, it's not really that the parallelism is too large, but still that the batch size is too large.

But the default parallelism can be set to a large value (relative to the batch size) automatically by Datasets, so if the user is relying on the default parallelism but is manually setting the batch size (e.g. because they have it tailored to the batch size that's needed to saturate a GPU for prediction), the parallelism may be the knob that they need to tune.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdeng Does that make sense? Any ideas for how we can improve this section? Otherwise I'm planning on merging this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah reading this over I think that I was referring to the last line:

the number of parallel mapper tasks may be less than expected

In which case lowering parallelism (or increasing block_size) won't increase the number of parallel tasks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more about making the transformation-time parallelism respect the read-time parallelism argument than increasing transformation-time parallelism, so maybe this doesn't need a call out?

The user expects the number of parallel mapper tasks to be what they give for read parallelism, but if parallelism is set too high, this can result in blocks that are much smaller than batch_size, and since we bundle blocks to try to achieve a full batch_size batch in our map tasks, the map parallelism can unexpectedly be a lot smaller than the read parallelism, which is what we're trying to call out here. All that decreasing parallelism would do in this case is result make the read-time parallelism more closely match the transformation-time parallelism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size. I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.

I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, my confusion was around what the user would consider "expected" behavior, and (maybe due to my existing knowledge) I assumed they would be limited by batch size.

The commonly expected behavior I've heard from users is that their transformation parallelism is equal to their read parallelism, so this paragraph is mostly explaining that discrepancy for why those might not match.

I guess the other thing about this paragraph is that (unlike the following paragraph) there's no concrete action for the user to take.

The action item of decreasing the batch size and/or decreasing the parallelism is somewhat implied by the "too large" comment, but since this is trying to alert about a possibly surprising discrepancy, not sure if we need a super explicit action item. 🤔

I don't have a concrete suggestion to improve this, so feel free to consider this comment non-blocking.

I'm going to leave this as-is for now so we have something in the docs to point to when talking about this parallelism discrepancy, but agreed that we might want to revisit this. I'm thinking that a lot of this feature guide will get reworked with the new execution planner + model, so we can revisit this at that time.

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
@clarkzinzow clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 55af475 to 133ded7 Compare October 21, 2022 22:35
@clarkzinzow clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 133ded7 to 6ebe8ac Compare November 14, 2022 14:35
@clarkzinzow
Copy link
Contributor Author

@matthewdeng Ping for another review, I updated the feature guide section after @c21's PR: #29971

cc @ray-project/ray-docs for code-owner review

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments.

doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
doc/source/data/transforming-datasets.rst Outdated Show resolved Hide resolved
:start-after: __configuring_batch_size_begin__
:end-before: __configuring_batch_size_end__

Increasing ``batch_size`` can result in faster execution by better leveraging SIMD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If just for SIMD, batch_size actually needs to be small enough to fit the CPU cache right?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Dec 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably reword this to "better leveraging vectorized operations and hardware" or something. There's the SIMD level of vectorization and there's also the native-core level of vectorization, where the latter refers to a data batch that's processed in a tight C loop, in the native-core of a particular data processing library (e.g. NumPy, Arrow, TensorFlow, Torch).

Along with the baseline universal truth that "looping in a native language is a lot faster than looping in Python", there are also a few different levels of hardware constraints here:

  1. SIMD: parallelized operation over 256 bits of the batch - 4 64-bit elements
  2. CPU cache line: 64 bytes - 8 64-bit elements
  3. CPU caches: 32 KiB L1 cache - 4096 64-bit elements, 256 KiB L2 cache - 32768 64-bit elements, 16 MiB - 2097152 64-bit elements

Assuming that the CPU does some quite aggressive cache line prefetching into the CPU caches when it detects our sequential/strided access, I would imagine that the stream of elements into the e.g. AVX2 instructions remains pretty hot.

So increasing from a batch size of 1 to 4 could lead to better SIMD fulfillment, increasing from 4 to 8 could lead to better cache line fulfillment, and increasing from 8 to e.g. 32, 64, 128 may better line up with cache line prefetching. It's difficult to be concrete without profiling a particular workload and setup, so maybe saying something like "better leveraging vectorized operations and hardware" is a good high-level umbrella statement.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 17, 2022
@stale
Copy link

stale bot commented Dec 17, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 17, 2022
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 19, 2022
@clarkzinzow clarkzinzow force-pushed the datasets/docs/batch-size-feature-guide branch from 6ebe8ac to 3f9e686 Compare December 20, 2022 20:30
@clarkzinzow clarkzinzow merged commit a667019 into ray-project:master Dec 22, 2022
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…n ``.map_batches()`` (#29117)

Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…n ``.map_batches()`` (ray-project#29117)

Users may need guidance on configuring batch_size for ds.map_batches(), since the hardcoded default of 4096 will be suboptimal for wide tables and datasets containing large images, and the semantics of parallelism vs. batch size got more complicated with the block bundling PR. This PR adds a section for configuring batch_size to the "Transforming Datasets" feature guide.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR - Datasets] Add feature guide for configuring batch_size for ds.map_batches()
5 participants