`data/benchmarks/` #416

msaroufim · 2022-05-17T20:00:09Z

🚀 The feature

We're proposing a folder to hold all benchmark scripts which would be easily reproducible by anyone from the core PyTorch Data team, PyTorch domain teams and the broader community. original author @vitaly-fedyunin

Motivation, pitch

As pytorch/data gains more widespread adoption there's going to be more questions about its performance. So it's important to have reusable reproducible scripts we can use. The dev team also needs to be able to monitor for regressions between releases and use benchmarks to inform additional performance optimizations

The script should be runnable with clear instructions and dependencies in a README.md, it should be possible to run the same script in CI with no changes. The script should output metrics in a human-readable markdown file

The main metric we're going to take a look is time to convergence in training against the traditional Dataset baseline using both DataLoader v1 and the experimental DataLoaderv2. The second most important metric is model accuracy to make sure we don't degrade training performance too much (see shuffling issues)

The final outcome should be support the cross product of all of the below configurations.

Datasets

COCO
ImageNet (if license allows)
The Pile
mc4

Models

Resnet50
Resnet128
BERT-B

Storage configuration

SSD
HDD
NFS
Cloud (S3)
Web (HTTP)

Other Metrics

Things to track:
Time per batch
Time per epoch
Precision Over Time
CPU load
GPU load (starvation)
Memory Usage

Alternatives

No response

Additional context

For each of the datasets continuously track the next implementations:

Baseline implementation (before conversion to DataPipes).
Migrated DataSet (to DataPipes), version as-is from Vision GitHub.
Option with Tar (or other simple archiving).
Option with Data Preproc (rearrange/repack/reformat) - using DataPipes & serialization, consider the speed of repacking too).
For Vision: Use NVidia Webdataset DataLoader and NVidia DALI DataLoader for comparison.
For Text: Use the HuggingFace dataset for comparison.

Ideally we would put one of these large datasets in an S3 bucket but they will throttle it so instead it’s best to setup an EC2 instance with a simple http server that makes the dataset available on an attached SSD disk which will allow us to do single node 8 GPU experiments. For multi-nodes we need to come up with a story for distributed storage.

The main metrics we need to look at

GPU utilization - higher is better we want preprocessing and training to happen concurrently
Data throughput per datapipe to measure any bottlenecks - more useful for data pipe authors. Can add some simple telemetry by default to any datapipe
End user is going to be looking at time to epoch and any accuracy loss

The text was updated successfully, but these errors were encountered:

NicolasHug · 2022-05-18T10:11:28Z

Thanks for opening this issue @msaroufim !

On top of model training time and accuracy, I think we'll also want to monitor the time for the DataLoader to yield an entire epoch (or 5), without a training loop. Ultimately we do care about training time, but it depends a lot on the GPU (and the number of GPUs).

Regarding the vision models to benchmarks, I would suggest the following instead of Resnet50 and Resnet128:

large batch + small model: IO bound. mobilenet_v3_large with batch-size 128
small batch + large model: compute heavy. resnext50_32x4d batch-size 32

(this is taken from past investigations from @datumbox (unrelated to datapipes)).

I spent a lot of time porting the torchvision training references to use datapipes. I don't think they're suitable for the kind of benchmark we want to do here (because they support tons of other training features, so they're too complex to be public as-is), but they could be a good start. Happy to get you started if you need.

NicolasHug · 2022-06-23T13:24:13Z

I spent a lot of time porting the torchvision training references to use datapipes. I don't think they're suitable for the kind of benchmark we want to do here (because they support tons of other training features, so they're too complex to be public as-is), but they could be a good start. Happy to get you started if you need.

FYI I just published this PR pytorch/vision#6196 which adds datapipe support to torchvision's classification training reference (without all the complex async-io stuff).

DataLoaderV2 doesn't support the DistributedReadingService right now so I'm sticking to DL1, but I'll start running more intensive benchmarks on my side as well.

NicolasHug · 2022-06-24T13:47:12Z

Some basic results, which are consistent with what I had a few months ago:

Benchmarking mobilenet_v3_large (io bound) from the torchvision training references (pytorch/vision#6196) on the AWS cluster, distributed over 8 A100 GPUs with 12 workers each. This is a very typical setup that we use constantly.

On fsx (pretty slow file system):
- training with datapipes is ~30% faster than with map-style datasets. But strangely, datapipe epochs seem to take increasinly longer (see details below).
On ontap (fast file system):
- training with datapipes is ~10% slower than with map-style datasets.

The ontap reports are more relevant, because in general there is no reason to use the slow fsx file system.

I will start running more in-depth experiments, e.g. completely removing the model-training part, to see if we can identify what could cause such stark differences.

Details

python -u ~/slurm/run_with_submitit.py --ngpus 8 --nodes 1 --model mobilenet_v3_large --epochs 5 --batch-size 128 --workers 12 --ds-type $ds_type --fs $fs

For ref: running just the model training with a pre-loaded dataset (no IO, no transforms) takes ~13 mins both both datapipes and mapstyle datasets. This is the "best" possible training time, assuming data-loading time is zero.

Note: we should ignore the first epoch because these file-system are sensitive to warm-up / caching.

file-system = fsx
ds-type=dp
Epoch: [0] Total time: 0:15:07
Epoch: [1] Total time: 0:15:36
Epoch: [2] Total time: 0:16:42
Epoch: [3] Total time: 0:19:26
Epoch: [4] Total time: 0:20:41
Training time 1:32:03

file-system = fsx
ds-type=mapstyle
Epoch: [0] Total time: 0:22:09
Epoch: [1] Total time: 0:24:51
Epoch: [2] Total time: 0:26:21
Epoch: [3] Total time: 0:25:50
Epoch: [4] Total time: 0:25:40
Training time 2:10:07

file-system = ontap
ds-type=dp
Epoch: [0] Total time: 0:10:02
Epoch: [1] Total time: 0:04:12
Epoch: [2] Total time: 0:04:10
Epoch: [3] Total time: 0:04:10
Epoch: [4] Total time: 0:04:10
Training time 0:28:01

file-system = ontap
ds-type=mapstyle
Epoch: [0] Total time: 0:07:32
Epoch: [1] Total time: 0:03:46
Epoch: [2] Total time: 0:03:47
Epoch: [3] Total time: 0:03:46
Epoch: [4] Total time: 0:03:45
Training time 0:23:40

Summary: Towards #416 This is a modified and simplified version of the torchvision classification training reference that provides: - Distributed Learning (DDP) vs 1-GPU training - Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets - Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets) - Timing of data-loading vs model training - any classification model from torchvision I removed a lot of non-essential features from the original reference, but I can simplify further. Typically I would expect the `MetricLogger` to disappear, or be trimmed down to its most essential bits. Pull Request resolved: #714 Reviewed By: NivekT Differential Revision: D38569273 Pulled By: NicolasHug fbshipit-source-id: 1bc4442ab826256123f8360c14dc8b3eccd73256

msaroufim mentioned this issue May 19, 2022

data/benchmarks #422

Closed

NicolasHug mentioned this issue Jun 23, 2022

[NOMRG] Benchmarking - Add datapipe support to classification training references pytorch/vision#6196

Closed

NicolasHug mentioned this issue Aug 4, 2022

Add benchmark from torchvision training references #714

Closed

msaroufim closed this as completed Aug 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`data/benchmarks/` #416

`data/benchmarks/` #416

msaroufim commented May 17, 2022 •

edited

Loading

NicolasHug commented May 18, 2022 •

edited

Loading

NicolasHug commented Jun 23, 2022

NicolasHug commented Jun 24, 2022 •

edited

Loading

data/benchmarks/ #416

data/benchmarks/ #416

Comments

msaroufim commented May 17, 2022 • edited Loading

🚀 The feature

Motivation, pitch

Alternatives

Additional context

NicolasHug commented May 18, 2022 • edited Loading

NicolasHug commented Jun 23, 2022

NicolasHug commented Jun 24, 2022 • edited Loading

Details

`data/benchmarks/` #416

`data/benchmarks/` #416

msaroufim commented May 17, 2022 •

edited

Loading

NicolasHug commented May 18, 2022 •

edited

Loading

NicolasHug commented Jun 24, 2022 •

edited

Loading