-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] Implement BucketBatcher with max token #283
Conversation
3aa8207
to
43a8c81
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
43a8c81
to
25641ed
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add this DataPipe to test_serialization.py
and the documentation rst
file
25641ed
to
bcd3668
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
bcd3668
to
43ebc10
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
>>> source_dp = IterableWrapper(['1', '22', '1', '4444', '333', '1', '22', '22', '333']) | ||
>>> batch_dp = source_dp.max_token_bucketize(max_token_count=5) | ||
>>> list(batch_dp) | ||
[['1', '1', '1', '22'], ['22', '22'], ['333'], ['333'], ['4444']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this example is kind of confusing, since the len_fn is missing and len(str)
is used as it. Can you change it to use 1
for all samples (like 1
, 111
, 1111
), or add indication how to define the length of the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. I was using different number to represent the length of each string.
43ebc10
to
ebdaf89
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Add
max_token_bucketizer
There are two options to do streaming bucketizer:
buffer
as a priority queue to keep yielding the shortest token dynamically. Whenever a token is fetched from buffer, a new token is pushed into the buffer.buffer
and yield batches frombuffer
without adding new tokens from DataPipe until the batches are yielded frombuffer
.The time complexity of these two methods should be the same as
O(N log(buffer_size)
. I prefer the first approach as we would reduce the potential skewness of data within local buffer.Added
in_batch_shuffle
as the syntax sugar and for future deterministic data preprocessing.