-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/benchmarks/
#416
Comments
Thanks for opening this issue @msaroufim ! On top of model training time and accuracy, I think we'll also want to monitor the time for the DataLoader to yield an entire epoch (or 5), without a training loop. Ultimately we do care about training time, but it depends a lot on the GPU (and the number of GPUs). Regarding the vision models to benchmarks, I would suggest the following instead of Resnet50 and Resnet128:
(this is taken from past investigations from @datumbox (unrelated to datapipes)). I spent a lot of time porting the torchvision training references to use datapipes. I don't think they're suitable for the kind of benchmark we want to do here (because they support tons of other training features, so they're too complex to be public as-is), but they could be a good start. Happy to get you started if you need. |
FYI I just published this PR pytorch/vision#6196 which adds datapipe support to torchvision's classification training reference (without all the complex async-io stuff). DataLoaderV2 doesn't support the DistributedReadingService right now so I'm sticking to DL1, but I'll start running more intensive benchmarks on my side as well. |
Some basic results, which are consistent with what I had a few months ago: Benchmarking
The I will start running more in-depth experiments, e.g. completely removing the model-training part, to see if we can identify what could cause such stark differences. Details
For ref: running just the model training with a pre-loaded dataset (no IO, no transforms) takes ~13 mins both both datapipes and mapstyle datasets. This is the "best" possible training time, assuming data-loading time is zero. Note: we should ignore the first epoch because these file-system are sensitive to warm-up / caching.
|
Summary: Towards #416 This is a modified and simplified version of the torchvision classification training reference that provides: - Distributed Learning (DDP) vs 1-GPU training - Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets - Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets) - Timing of data-loading vs model training - any classification model from torchvision I removed a lot of non-essential features from the original reference, but I can simplify further. Typically I would expect the `MetricLogger` to disappear, or be trimmed down to its most essential bits. Pull Request resolved: #714 Reviewed By: NivekT Differential Revision: D38569273 Pulled By: NicolasHug fbshipit-source-id: 1bc4442ab826256123f8360c14dc8b3eccd73256
🚀 The feature
We're proposing a folder to hold all benchmark scripts which would be easily reproducible by anyone from the core PyTorch Data team, PyTorch domain teams and the broader community. original author @vitaly-fedyunin
Motivation, pitch
As
pytorch/data
gains more widespread adoption there's going to be more questions about its performance. So it's important to have reusable reproducible scripts we can use. The dev team also needs to be able to monitor for regressions between releases and use benchmarks to inform additional performance optimizationsThe script should be runnable with clear instructions and dependencies in a
README.md
, it should be possible to run the same script in CI with no changes. The script should output metrics in a human-readable markdown fileThe main metric we're going to take a look is time to convergence in training against the traditional Dataset baseline using both DataLoader v1 and the experimental DataLoaderv2. The second most important metric is model accuracy to make sure we don't degrade training performance too much (see shuffling issues)
The final outcome should be support the cross product of all of the below configurations.
Datasets
Models
Storage configuration
Other Metrics
Alternatives
No response
Additional context
For each of the datasets continuously track the next implementations:
Ideally we would put one of these large datasets in an S3 bucket but they will throttle it so instead it’s best to setup an EC2 instance with a simple http server that makes the dataset available on an attached SSD disk which will allow us to do single node 8 GPU experiments. For multi-nodes we need to come up with a story for distributed storage.
The main metrics we need to look at
The text was updated successfully, but these errors were encountered: