[DataPipe] Implement BucketBatcher with max token #283

ejguan · 2022-03-07T22:51:32Z

Add max_token_bucketizer

There are two options to do streaming bucketizer:

Using buffer as a priority queue to keep yielding the shortest token dynamically. Whenever a token is fetched from buffer, a new token is pushed into the buffer.
Sort buffer and yield batches from buffer without adding new tokens from DataPipe until the batches are yielded from buffer.

The time complexity of these two methods should be the same as O(N log(buffer_size). I prefer the first approach as we would reduce the potential skewness of data within local buffer.

Added in_batch_shuffle as the syntax sugar and for future deterministic data preprocessing.

facebook-github-bot · 2022-03-07T23:02:46Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torchdata/datapipes/iter/transform/bucketbatcher.py

facebook-github-bot · 2022-03-08T20:14:53Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

torchdata/datapipes/iter/transform/bucketbatcher.py

NivekT

We also need to add this DataPipe to test_serialization.py and the documentation rst file

torchdata/datapipes/iter/transform/bucketbatcher.py

NivekT

LGTM! Thanks!

facebook-github-bot · 2022-03-08T22:41:43Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

nateanl · 2022-03-10T07:47:19Z

torchdata/datapipes/iter/transform/bucketbatcher.py

+        >>> source_dp = IterableWrapper(['1', '22', '1', '4444', '333', '1', '22', '22', '333'])
+        >>> batch_dp = source_dp.max_token_bucketize(max_token_count=5)
+        >>> list(batch_dp)
+        [['1', '1', '1', '22'], ['22', '22'], ['333'], ['333'], ['4444']]


I feel like this example is kind of confusing, since the len_fn is missing and len(str) is used as it. Can you change it to use 1 for all samples (like 1, 111, 1111), or add indication how to define the length of the string?

Sure, I can do that. I was using different number to represent the length of each string.

facebook-github-bot · 2022-03-10T18:15:08Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 7, 2022

ejguan requested review from NivekT and nateanl March 7, 2022 22:52

ejguan force-pushed the max_token_bucketizer branch from 3aa8207 to 43a8c81 Compare March 7, 2022 23:02

nateanl reviewed Mar 8, 2022

View reviewed changes

torchdata/datapipes/iter/transform/bucketbatcher.py Outdated Show resolved Hide resolved

torchdata/datapipes/iter/transform/bucketbatcher.py Outdated Show resolved Hide resolved

ejguan force-pushed the max_token_bucketizer branch from 43a8c81 to 25641ed Compare March 8, 2022 20:13

ejguan requested a review from mthrok March 8, 2022 20:25