Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve training performance when using augmentations / crop and minor model improvements #1050

Merged
merged 22 commits into from
Nov 5, 2022

Conversation

DocGarbanzo
Copy link
Contributor

Improve training and simplify model interfaces

Caching

  • The current setup allows to cache images in memory when they are read from disk. However, this feature is only available to the original image and not when transformations or augmentations are applied at a later state in the training pipeline. These are called on the fly and produce a significant performance hit during training, as the preprocessing slows down the tf data generator.
  • Here we extend the caching functionality within the TubRecord class to accept the image processor and cache the processed image instead of the raw image to make up for the performance loss explained above. Image caching within the TubRecord is switched on by default, as it was before. It can now be switched off by setting CACHE_IMAGES to False in the config file.
  • Transformations / augmentations are performed on uint8 images. In training these are the cached objects. The normalisation to [0, 1] float64 data still happens on the fly, as it is fast, and storing the 8-times smaller uint8 images results in much smaller memory consumption.

KerasPilot

  • In addition we are simplifying the model (or KerasPilot) interfaces to not distinguish between x_transform_and_process and x_translate any longer but merge both functionality into the single x_transform call. The same is applied to y_transform and y_translate. The original idea that returning numpy arrays from the transform function and converting them into dictionaries in the translate functions does not prove to be of any advantage, because pipeline transformations can be performed on the dictionaries in the same way as on numpy data.

Other

  • Unrelated we found an issue with local pytorch tests and changed the metrics code to support newer versions of pytorch.

…ing so only the augmented images are getting cached - and they are uint8.
…, y_transform, y_translate:

* Combined the transform / translate functions into the transform function because we can operate in a pipeline on the dictionaries that are returned by translate directly
* Enable switching off image caching by re-introducing CACHE_IMAGES optionally in the config
* Updated test_train as the fixture scope was a bit messed up w.r.t to usage of the tub data and config file. Also replaced namedtuple against dataclass which is a bit more modern.
…ing so only the augmented images are getting cached - and they are uint8.
…, y_transform, y_translate:

* Combined the transform / translate functions into the transform function because we can operate in a pipeline on the dictionaries that are returned by translate directly
* Enable switching off image caching by re-introducing CACHE_IMAGES optionally in the config
* Updated test_train as the fixture scope was a bit messed up w.r.t to usage of the tub data and config file. Also replaced namedtuple against dataclass which is a bit more modern.
@DocGarbanzo DocGarbanzo requested a review from Ezward October 16, 2022 21:23
@DocGarbanzo DocGarbanzo self-assigned this Oct 18, 2022
@DocGarbanzo DocGarbanzo added the Neural network Anything for the neural network, the architecture, training, inferencing label Oct 18, 2022
@Ezward
Copy link
Contributor

Ezward commented Oct 24, 2022

This branch improved my Epoch times from 5 minutes to 2 minutes. Total training time on my RTX-2060 went from 2.5 hours to 45 minutes for 21 epochs.

Epoch 21/100
117/117 [==============================] - ETA: 0s - loss: 0.0451 - n_outputs0_loss: 0.0356 - n_outputs1_loss: 0.0095
Epoch 00021: val_loss did not improve from 0.03901
117/117 [==============================] - 117s 1s/step - loss: 0.0451 - n_outputs0_loss: 0.0356 - n_outputs1_loss: 0.0095 - val_loss: 0.0414 - val_n_outputs0_loss: 0.0324 - val_n_outputs1_loss: 0.0089

Started at Sun, Oct 23, 2022 5:48:41 PM
Finished at Sun, Oct 23, 2022 6:34:34 PM

I have not tried this on a car yet. However if I try to run tubplot I get an error;

export data_1715='C:/Users/emurm/projects/autorope/donkey_datasets/circuit_launch_20210716/murmurpi4_circuit_launch_20210716_1715/data'
donkey tubplot --tub="${data_1715}" --type=linear --model=models/foo.h5
...
INFO:donkeycar.parts.keras:Created KerasLinear with interpreter: KerasInterpreter
INFO:donkeycar.parts.keras:Loading model models/foo.h5
INFO:donkeycar.parts.interpreter:Loading model models/foo.h5
Using catalog C:\Users\emurm\projects\autorope\donkey_datasets\circuit_launch_20210716\murmurpi4_circuit_launch_20210716_1715\data\catalog_17.catalog
INFO:donkeycar.pipeline.types:Loading tubs from paths ['C:/Users/emurm/projects/autorope/donkey_datasets/circuit_launch_20210716/murmurpi4_circuit_launch_20210716_1715/data']
←[?25lInferencingTraceback (most recent call last):
  File "C:\Users\emurm\Miniconda3\envs\donkey\Scripts\donkey-script.py", line 33, in <module>
    sys.exit(load_entry_point('donkeycar', 'console_scripts', 'donkey')())
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 619, in execute_from_command_line
    c.run(args[2:])
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 516, in run
    self.plot_predictions(cfg, args.tub, args.model, args.limit, args.type)
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 468, in plot_predictions
    output_names = model.output_shapes()(1).keys()
TypeError: 'tuple' object is not callable

@DocGarbanzo
Copy link
Contributor Author

Thanks @Ezward - there was indeed an error in donkey tubplot. I have corrected this and also added a test for donkey tubplot to prevent failure in future changes.

…o cache_augmented_images

# Conflicts:
#	donkeycar/management/base.py
…call to open the matplotlib window. Use this argument in the test.
…call to open the matplotlib window. Use this argument in the test.
Copy link
Contributor

@Ezward Ezward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Really good speed improvement. Thanks.

@DocGarbanzo DocGarbanzo merged commit a40e217 into autorope:main Nov 5, 2022
@DocGarbanzo DocGarbanzo deleted the cache_augmented_images branch November 29, 2022 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Neural network Anything for the neural network, the architecture, training, inferencing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants