Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EfficientNet example using automatic augmentations with DALI #4678

Merged
merged 11 commits into from
Mar 14, 2023

Conversation

klecki
Copy link
Contributor

@klecki klecki commented Feb 28, 2023

Category: New feature, Other

Description:

This example ports the EfficientNet sample from DeepLearningExamples repository.

The example is limited to efficientnet-b0 variant for simplicity. DALI pipeline is updated to use fn API and to use new automatic augmentations: adding options to select both AutoAugment and TrivialAugment.

The main.py is adjusted so the defaults are suitable for EfficientNet training immediately (previously they were the defaults for RN50 training) and launch.py is no longer needed - the original example was started via launch.py, that looked up default values for specific network in an .yml config and passed them to the main.py. This way we can use main.py directly without the layers of intermediate scripts.

The benchmarks from readme are used to implement the L3 test.

The automatic augmentations come from: #4648. It can already be reviewed as the API is basically one-line invocation within the pipeline definition.

Additional information:

Affected modules and functionalities:

Docs/examples PR with L3 test.

Key points relevant for the review:

❗ Please check if the defaults in main.py match the ones in https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/configs.yml for DGX-1V, efficientnet-b0.

How to review this PR

I suggest checking out the code and running diff tool like meld to compare directories:

  • docs/examples/use_cases/pytorch/efficientnet/ from this PR
  • PyTorch/Classification/ConvNets/ from DeepLearningExamples

that way you can see that most of this PR are just files copied over.

Individual commits description:

I tried my best to split the PR into commits that are easier to review. Those are the steps (and specific commits):

  1. Copy the contents of DeepLearningExamples PyTorch/Classification/ConvNets/
  2. Remove as many files as possible from the resulting directory.
  3. Remove the usage of layouts other than efficientnet-b0
  4. Propagating the defaults from https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/configs.yml to main.py
  5. Add new DALI pipeline, readme, new arguments <- most changes here
    • new pipeline is defined in dali.py
    • old is removed from dataloaders.py
    • main.py gets some argument changes
    • readme.rst is introduced.
  6. Add new L3 test + some small fixes for pipeline.

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: 3194

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@klecki klecki modified the milestone: Automatic augmentations support Mar 6, 2023
@klecki klecki added the automatic augmentations Automatic augmentations (AutoAugment, RandAugment, TrivialAugment and more) support in DALI. label Mar 6, 2023
@klecki klecki force-pushed the efficientnet-example branch 3 times, most recently from d072966 to 3a7a54a Compare March 6, 2023 16:06
@klecki klecki marked this pull request as ready for review March 6, 2023 16:18
@klecki klecki force-pushed the efficientnet-example branch 2 times, most recently from b33039a to 341591d Compare March 6, 2023 16:27
@klecki
Copy link
Contributor Author

klecki commented Mar 6, 2023

Todo,

  • apply in the test:
[PASS] bench_report_synthetic.json above threshold: 11213.31192785647 >= 10800                                                                                                                                                                                                                [PASS] bench_report_dali.json above threshold: 9440.822217370116 >= 6000                                                                                                                                                                                                                      [PASS] bench_report_dali_aa.json above threshold: 9535.54048843674 >= 9000                                                                                                                                                                                                                    [PASS] bench_report_dali_ta.json above threshold: 9485.628250563821 >= 9000                                                                                                                                                                                                                   [PASS] bench_report_pytorch.json above threshold: 7501.932983753756 >= 7200                                                                                                                                                                                                                   [PASS] bench_report_pytorch_aa.json above threshold: 7329.357160196307 >= 7200            
  • reduce the number of epochs, possibly disable some variants.

@jantonguirao
Copy link
Contributor

Can you make this PR to have a first commit with a copy of what's in DeepLearningExamples so that I can see what you actually changed?

@klecki
Copy link
Contributor Author

klecki commented Mar 7, 2023

Can you make this PR to have a first commit with a copy of what's in DeepLearningExamples so that I can see what you actually changed?

It is already done this way, look at the PR description for details about individual commits.

else:
output = images

output = fn.crop_mirror_normalize(output, dtype=types.FLOAT, output_layout=types.NCHW,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output = fn.crop_mirror_normalize(output, dtype=types.FLOAT, output_layout=types.NCHW,
output = fn.crop_mirror_normalize(output, dtype=types.FLOAT, output_layout="CHW",

types.NCHW was deprecated years ago

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized, that we should have NHWC as a parameter here, and I wonder how it even got a good training result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are doing a double transposition unnecessarily, I wonder if and how much faster we can get without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get around the transposition, I am not sure if it gives any benefit, the benchmarks give me a bit more samples/s but that might be just noise.

Either way, I use "CHW" and "HWC" now and produce the memory in target layout for NHWC case.

images = fn.decoders.image(jpegs, device="mixed", output_type=types.RGB)

images = fn.resize(images, resize_shorter=image_size, interp_type=interpolation,
antialias=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really want to disable antialiasing? Just asking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it from the original pipeline.

Copy link
Contributor

@jantonguirao jantonguirao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments only

Copy link
Member

@szalpal szalpal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put some comments in the readme. I still have the sh file left to review


This example shows how DALI's implementation of automatic augmentations - most notably `AutoAugment <https://arxiv.org/abs/1805.09501>`_ and `TrivialAugment <https://arxiv.org/abs/2103.10158>`_ - can be used in training. It shows the training of EfficientNet, an image classification model first described in `EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks <https://arxiv.org/abs/1905.11946>`_.

The code is based on `NVIDIA Deep Learning Examples <https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/efficientnet>`_ - it has been extended with DALI pipeline supporting automatic augmentations, which can be found in :fileref:`here <docs/examples/use_cases/pytorch/efficientnet/image_classification/dali.py>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fileref thingy does not render properly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does in the sphinx, I took it from SSD example. I will try to please both Sphinx and GH, but I don't know if it is possible.

* ``--data-backend`` parameter was changed to accept ``dali``, ``pytorch``, or ``synthetic``. It is set to ``dali`` by default.
* ``--dali-device`` was added to control placement of some of DALI operators.
* ``--augmentation`` was replaced with ``--automatic-augmentation``, now supporting ``disabled``, ``autoaugment``, and ``trivialaugment`` values.
* ``--workers`` defaults were halved to accommodate DALI. The value is automatically doubled when ``pytorch`` data loader is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to explain, why workers needed to be halved to accommodate DALI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a bit about fitting both loaders with good default, but I am not sure if I really want to dive deep into how it works.

docs/examples/use_cases/pytorch/efficientnet/readme.rst Outdated Show resolved Hide resolved

* For inference:

* Scale to target image size + 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is not really clear, what +32 means in this context. Could you explain this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just 224 + 32 - the definition was taken from the original model as was (partially) this description. I will reword it a bit.

docs/examples/use_cases/pytorch/efficientnet/readme.rst Outdated Show resolved Hide resolved
docs/examples/use_cases/pytorch/efficientnet/readme.rst Outdated Show resolved Hide resolved
Comment on lines 51 to 52
# TODO(klecki): Move it back again
import torchvision.datasets as datasets
import torchvision.transforms as transforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this TODO is a leftover or something that should stay. However, if it should stay, I believe it should describe better what should be done ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter much, but it breaks in my local setup as torchvision clashes with my local build of DALI, so I need to import DALI first.

# of Pipeline definition, this `if` statement relies on static scalar parameter, so it is
# evaluated exactly once during build - we either include automatic augmentations or not.
if automatic_augmentation == "autoaugment":
shapes = fn.peek_image_shape(jpegs)
Copy link
Member

@stiepan stiepan Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image sizes are uniform (and they are thanks to the resize) we can skip shapes and use just the absolute version and use max_translation_abs=250 or 224.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to show the more flexible version in this example if it doesn't cause perf issues. Let me check.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Adjust some configuration options to accomodate it.
Remove the obsolete pipeline

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
The test run 1 less epoch than the readme to make it a bit shorter

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7580814]: BUILD STARTED

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7582436]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7580814]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [7582436]: BUILD PASSED

@klecki klecki merged commit 6715606 into NVIDIA:main Mar 14, 2023
aderylo pushed a commit to zpp-dali-2022/DALI that referenced this pull request Mar 17, 2023
…DIA#4678)

This example ports the [EfficientNet](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/efficientnet/README.md) sample from DeepLearningExamples repository.

The example is limited to efficientnet-b0 variant for simplicity. DALI pipeline is updated to use `fn` API and to use new automatic augmentations: adding options to select both AutoAugment and TrivialAugment.

The main.py is adjusted so the defaults are suitable for EfficientNet training immediately (previously they were the defaults for RN50 training) and launch.py is no longer needed - the original example was started via launch.py, that looked up default values for specific network in an .yml config and passed them to the main.py. This way we can use main.py directly without the layers of intermediate scripts. 

The benchmarks from readme are used to implement the L3 test.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
@JanuszL JanuszL mentioned this pull request Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automatic augmentations Automatic augmentations (AutoAugment, RandAugment, TrivialAugment and more) support in DALI.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants