-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expanded range shuffle #394
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
snarayan21
force-pushed
the
expanded_range_shuffle
branch
from
August 29, 2023 18:26
13694b6
to
d580d0f
Compare
knighton
reviewed
Aug 29, 2023
knighton
reviewed
Aug 29, 2023
knighton
reviewed
Aug 29, 2023
knighton
reviewed
Aug 29, 2023
knighton
reviewed
Aug 29, 2023
knighton
reviewed
Aug 29, 2023
knighton
approved these changes
Aug 29, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor nits then LFG
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
Implements a new shuffle - the "expanded range" shuffle, where the range that a shard's samples can appear is expanded, and samples are randomly placed within that range.
Helpful slides: https://docs.google.com/presentation/d/1UHijFFgA0IPUxiOVv4aevGclSc83HKfJ6ae3PZtzc0c/edit?usp=sharing
Suppose we have 10 shards in a canonical node, each shard has 100 samples, and our shuffle block size is set to 500 samples. With py1e, for each shard, its samples will now be distributed over a window of maximum size 500 (equal to shuffle block size). However, these windows cannot cross canonical node boundaries because we don't want overlap between samples from different canonical nodes. So shard 1 will have a window of 300 samples (the max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 300 samples), shard 2, 400 samples (again, max window is 500 samples, but we can't cross the CN boundary, so our window is shortened to 400 samples), and shard 3-8 will have windows of 500 samples, shard 9, similar to shard 2, has a window of 400 samples (because it can't cross the end of the canonical node), and shard 10 has a window of 300 samples.
Within a canonical node, the total number of shards we need (assuming all shards are the same size) is given by SBS/(# samples per shard). This is the same for algorithms like py1b and py1br. However, when py1b and py1br cross canonical node boundaries, because there is some predownload, as training approaches the end of a canonical node, the predownload will look ahead into the next canonical node. Because the first shuffle block in the next canonical node is shuffled, this means our predownload likely will need to download many shards for the upcoming shuffle block. This results in a spike in downloading, and requires a higher cache limit to store these shards without any negative impact to throughput.
In contrast, with py1e shuffling, as you approach the end of a canonical node, the number of shards you need to proceed with training through the current canonical node approaches 0.5*(SBS/(# samples per shard)). Similarly, at the start of the next canonical node, the number of shards needed to fulfill the first few batches also starts at 0.5*(SBS/(# samples per shard)). This means that I can maintain a small predownload to look ahead into the next canonical node with a lower cache limit, since in total, I need 0.5*(SBS/(# samples per shard)) + 0.5*(SBS/(# samples per shard)) = SBS/(# samples per shard), meaning that the number of shards I need to store per node is constant throughout training.
Additionally, downloading is also more balanced since the shards I need to download steadily ramps up from 0.5*SBS/(# samples per shard) in the beginning of a canonical node to SBS/(# samples per shard). This results in a more smooth downloading curve than algorithms like py1b or even py1br, which both download all the shards needed for a shuffle block within the span of a few batches.
Issue #, if available:
https://mosaicml.atlassian.net/browse/STR-127
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)