Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] Implement BucketBatcher with max token #283

Closed
wants to merge 1 commit into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Mar 7, 2022

Add max_token_bucketizer

There are two options to do streaming bucketizer:

  • Using buffer as a priority queue to keep yielding the shortest token dynamically. Whenever a token is fetched from buffer, a new token is pushed into the buffer.
  • Sort buffer and yield batches from buffer without adding new tokens from DataPipe until the batches are yielded from buffer.

The time complexity of these two methods should be the same as O(N log(buffer_size). I prefer the first approach as we would reduce the potential skewness of data within local buffer.

Added in_batch_shuffle as the syntax sugar and for future deterministic data preprocessing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 7, 2022
@ejguan ejguan requested review from NivekT and nateanl March 7, 2022 22:52
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan requested a review from mthrok March 8, 2022 20:25
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add this DataPipe to test_serialization.py and the documentation rst file

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

>>> source_dp = IterableWrapper(['1', '22', '1', '4444', '333', '1', '22', '22', '333'])
>>> batch_dp = source_dp.max_token_bucketize(max_token_count=5)
>>> list(batch_dp)
[['1', '1', '1', '22'], ['22', '22'], ['333'], ['333'], ['4444']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this example is kind of confusing, since the len_fn is missing and len(str) is used as it. Can you change it to use 1 for all samples (like 1, 111, 1111), or add indication how to define the length of the string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. I was using different number to represent the length of each string.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants