-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/benchmarks #422
data/benchmarks #422
Conversation
May 24 notes Notes from discussion with Vitaly
May 31 notes
|
Next thing I'd like to try out is including these datasets from |
The call graph for data pipe construction is long but it's not the data loading that's the bottleneck here since utilization is just 7%. Need to fix the collation problems and bump up the batch size Also I finally figured out how to build torchtext from source so can use those datapipes as well pytorch/text#1743 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it a quick look before our meeting!
|
||
else: | ||
# No further preprocessing needed this returns a tuple of Images and labels as ints | ||
# Do I need to do batching and collation manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I need to do batching and collation manually?
No, you just need to pass batch_size=...
to the DataLoader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicolas is right that you can do that for DataLoader
.
I would think it is better to use .batch
since it will be necessary for DataLoaderV2
. Then you can pass the same DataPipe to both versions of DL without more changes later.
benchmarks/datasets.py
Outdated
dp = dp.map(lambda sample : torch.tensor(str_to_list(sample.to_categories())).to(torch.device(device)), input_col="label") | ||
|
||
# Batch | ||
dp = dp.batch(batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is needed as long as you pass batch_size=...
to the DataLoader.
(But not certain either)
if num_workers == 1: | ||
dl = DataLoader(dataset=data, batch_size=batch_size, shuffle=shuffle) | ||
|
||
# Shuffle won't work in distributed yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shuffle and sharding won't work out of the box with DDP. There are some suggestions here pytorch/text#1755, but no definite recommended practices yet
Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425
Summary: In working on #422 I realized we needed an easier way to run benchmarks on larger datasets that may not be available on domain libraries. This PR is a prerequisite to any sort of scaling benchmarks and is generally a useful reader for the community. https://huggingface.co/docs/datasets/how_to has about 10,000 datasets we can leverage out of the box - in particular mc4 is one we need to prove out large scale benchmarks for text See test and docstring for usage instructions ### Changes - Added a new `HuggingFaceHubReaderIterDataPipe` so we can load a large number of datasets for performance benchmarks - Added a test which is skipped if `datasets` library does not exist - pytest passes - Got rid of `StreamWrapper` - Is there any documentation update I should make? Pull Request resolved: #490 Reviewed By: NivekT, ninginthecloud Differential Revision: D36910175 Pulled By: msaroufim fbshipit-source-id: 3ce2d5bc0ad46b626baa87b59930a3c6f5361425 Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
Discussion with Vitaly June 21
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM with a few comments. Let me know what you additional features you plan to add.
nit: Need copyright headers for .py
files
|
||
else: | ||
# No further preprocessing needed this returns a tuple of Images and labels as ints | ||
# Do I need to do batching and collation manually? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicolas is right that you can do that for DataLoader
.
I would think it is better to use .batch
since it will be necessary for DataLoaderV2
. Then you can pass the same DataPipe to both versions of DL without more changes later.
Thanks @NivekT will address all your feedback - As far as new features to add for this PR not much I think there's a bunch of cleanup I need to do
And I think the next PR should be focused around integrating the aws cli into CI where we can benchmark a distributed systems setup per @NicolasHug's request And after that we can see which of the partner integrations should be added to this setup as weell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to drop usage of dataloader_experimental.
if dataloaderv == 1: | ||
from torch.utils.data import DataLoader | ||
elif dataloaderv == 2: | ||
from torch.utils.data.dataloader_experimental import DataLoader2 as DataLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using this one, perhaps you need to create your own wrapper, which will use DLv2 from torchdata repo and automatically create MultiProcessingReadingService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is up to @NivekT , if he wants to have this update as follow-up or within this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NivekT do you have more context, I'm not sure I follow what change is being requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dataloader_experimental.DataLoader2
is bound to be removed eventually. The actual DataLoader2
is in torchdata.dataloader2
.
I have an example in my benchmarking PR, if you're interested https://github.com/pytorch/vision/pull/6196/files#diff-32b42103e815b96c670a0b5f0db055fe63f10fc8776ccbb6aa9b61a6940abba0R207-R211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think using torchdata.dataloader2
with MultiProcessingReadingService
is preferred.
duration = int | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NivekT what do you think about this kind of reporting instead? Probably still a few bugs around but just wanna make sure I'm not overengineering things
@msaroufim @VitalyFedyunin @NivekT following up on my earlier comments in #416 (comment) I also have a separate PR (pytorch/vision#6196) that already provides support for the cross-product of:
(It also has FFCV support, but that's less relevant for us here). Since it's directly adapted from torchvision recipes, it's also a bit closer to the kind of training that users would be doing in the wild. Do you think it would make sense to join our benchmarking efforts here? I'm happy to provide support if you'd like to collaborate. CC @nairbv |
@NicolasHug I am in the processing of going through both setups, running them on our AWS cluster, and identifying the differences. I agree that combining the efforts is the right approach. Let me dig a bit deeper first and I can schedule a meeting for all of us to chat. |
@NicolasHug I think the right way to divide this up would be
I would like to also eventually do something like pull any of the HF datasets and just benchmark there but I don't believe the datasets there give me sufficient information to create a toy model with the right shapes automatically But yeah would love to talk |
Summary: Per our offline discussion, I am separating out the cloud part of #422 which will allow us to provision an AWS infra from the command line for offline benchmarks and potentially integrate into CI and release decisions. In followup PRs, I will make the template more interesting and revisit the logger discussion so we can build this. (This PR is the middle node) ![Screen Shot 2022-07-25 at 10 14 40 AM](https://user-images.githubusercontent.com/3282513/180836193-f75e25c4-c2ed-4d8d-8f14-67849e232a44.png) Here's a test that the template works ![Screen Shot 2022-07-25 at 10 15 13 AM](https://user-images.githubusercontent.com/3282513/180836270-0d26c697-eb8f-46bb-9f76-db76761555d2.png) Pull Request resolved: #680 Reviewed By: NivekT Differential Revision: D38124731 Pulled By: msaroufim fbshipit-source-id: 5b6749e291f06849a5a25b3b01fb7430c656e695
total_duration: int = 0 | ||
|
||
|
||
class MetricExporter(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on our discussion from Monday, I took a look at this and I think it should be reasonably simple to plug it into the other MetricLogger
from the torchvision recipe (it's "metric calculator" more than a "logger", and we can remove the "log" part in favour of this one)
Please read through our contribution guide prior to
creating your pull request.
Fixes #416
Changes