Improve training performance when using augmentations / crop and minor model improvements #1050

DocGarbanzo · 2022-10-16T21:23:21Z

Improve training and simplify model interfaces

Caching

The current setup allows to cache images in memory when they are read from disk. However, this feature is only available to the original image and not when transformations or augmentations are applied at a later state in the training pipeline. These are called on the fly and produce a significant performance hit during training, as the preprocessing slows down the tf data generator.
Here we extend the caching functionality within the TubRecord class to accept the image processor and cache the processed image instead of the raw image to make up for the performance loss explained above. Image caching within the TubRecord is switched on by default, as it was before. It can now be switched off by setting CACHE_IMAGES to False in the config file.
Transformations / augmentations are performed on uint8 images. In training these are the cached objects. The normalisation to [0, 1] float64 data still happens on the fly, as it is fast, and storing the 8-times smaller uint8 images results in much smaller memory consumption.

KerasPilot

In addition we are simplifying the model (or KerasPilot) interfaces to not distinguish between x_transform_and_process and x_translate any longer but merge both functionality into the single x_transform call. The same is applied to y_transform and y_translate. The original idea that returning numpy arrays from the transform function and converting them into dictionaries in the translate functions does not prove to be of any advantage, because pipeline transformations can be performed on the dictionaries in the same way as on numpy data.

Other

Unrelated we found an issue with local pytorch tests and changed the metrics code to support newer versions of pytorch.

…orm_and_process.

… result can be cached, too.

…ing so only the augmented images are getting cached - and they are uint8.

…, y_transform, y_translate: * Combined the transform / translate functions into the transform function because we can operate in a pipeline on the dictionaries that are returned by translate directly * Enable switching off image caching by re-introducing CACHE_IMAGES optionally in the config * Updated test_train as the fixture scope was a bit messed up w.r.t to usage of the tub data and config file. Also replaced namedtuple against dataclass which is a bit more modern.

…orm_and_process.

… result can be cached, too.

…ing so only the augmented images are getting cached - and they are uint8.

…, y_transform, y_translate: * Combined the transform / translate functions into the transform function because we can operate in a pipeline on the dictionaries that are returned by translate directly * Enable switching off image caching by re-introducing CACHE_IMAGES optionally in the config * Updated test_train as the fixture scope was a bit messed up w.r.t to usage of the tub data and config file. Also replaced namedtuple against dataclass which is a bit more modern.

…bump version.

Ezward · 2022-10-24T01:39:10Z

This branch improved my Epoch times from 5 minutes to 2 minutes. Total training time on my RTX-2060 went from 2.5 hours to 45 minutes for 21 epochs.

Epoch 21/100
117/117 [==============================] - ETA: 0s - loss: 0.0451 - n_outputs0_loss: 0.0356 - n_outputs1_loss: 0.0095
Epoch 00021: val_loss did not improve from 0.03901
117/117 [==============================] - 117s 1s/step - loss: 0.0451 - n_outputs0_loss: 0.0356 - n_outputs1_loss: 0.0095 - val_loss: 0.0414 - val_n_outputs0_loss: 0.0324 - val_n_outputs1_loss: 0.0089

Started at Sun, Oct 23, 2022 5:48:41 PM
Finished at Sun, Oct 23, 2022 6:34:34 PM

I have not tried this on a car yet. However if I try to run tubplot I get an error;

export data_1715='C:/Users/emurm/projects/autorope/donkey_datasets/circuit_launch_20210716/murmurpi4_circuit_launch_20210716_1715/data'
donkey tubplot --tub="${data_1715}" --type=linear --model=models/foo.h5
...
INFO:donkeycar.parts.keras:Created KerasLinear with interpreter: KerasInterpreter
INFO:donkeycar.parts.keras:Loading model models/foo.h5
INFO:donkeycar.parts.interpreter:Loading model models/foo.h5
Using catalog C:\Users\emurm\projects\autorope\donkey_datasets\circuit_launch_20210716\murmurpi4_circuit_launch_20210716_1715\data\catalog_17.catalog
INFO:donkeycar.pipeline.types:Loading tubs from paths ['C:/Users/emurm/projects/autorope/donkey_datasets/circuit_launch_20210716/murmurpi4_circuit_launch_20210716_1715/data']
←[?25lInferencingTraceback (most recent call last):
  File "C:\Users\emurm\Miniconda3\envs\donkey\Scripts\donkey-script.py", line 33, in <module>
    sys.exit(load_entry_point('donkeycar', 'console_scripts', 'donkey')())
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 619, in execute_from_command_line
    c.run(args[2:])
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 516, in run
    self.plot_predictions(cfg, args.tub, args.model, args.limit, args.type)
  File "d:\projects\docgarbanzo\donkeycar\donkeycar\management\base.py", line 468, in plot_predictions
    output_names = model.output_shapes()(1).keys()
TypeError: 'tuple' object is not callable

DocGarbanzo · 2022-10-24T21:26:36Z

Thanks @Ezward - there was indeed an error in donkey tubplot. I have corrected this and also added a test for donkey tubplot to prevent failure in future changes.

…o cache_augmented_images # Conflicts: # donkeycar/management/base.py

…call to open the matplotlib window. Use this argument in the test.

Ezward

This looks great. Really good speed improvement. Thanks.

…n see what's going on in the CI for OSX.

… when run in the shell subprocess. Needs further investigate, but it works locally. For now, just run under Linux.

# Conflicts: # donkeycar/tests/test_web_socket.py

DocGarbanzo added 9 commits October 16, 2022 17:40

Removed x_transform from keras models and integrated it into x_transf…

8388b24

…orm_and_process.

Move the image processing into the TubRecord.image() function so it's…

b09e347

… result can be cached, too.

Move the image normalisation outside of the image processing in train…

e80e2a0

…ing so only the augmented images are getting cached - and they are uint8.

Removed x_transform from keras models and integrated it into x_transf…

42afe0a

…orm_and_process.

Move the image processing into the TubRecord.image() function so it's…

da466ee

… result can be cached, too.

Move the image normalisation outside of the image processing in train…

d93b111

…ing so only the augmented images are getting cached - and they are uint8.

Fix pytorch resnet as it seems the metrics module has changed. Also, …

d43ff59

…bump version.

DocGarbanzo requested a review from Ezward October 16, 2022 21:23

DocGarbanzo self-assigned this Oct 18, 2022

DocGarbanzo added the Neural network Anything for the neural network, the architecture, training, inferencing label Oct 18, 2022

DocGarbanzo added 2 commits October 24, 2022 20:28

Fix errors in donkey tubplot

308e17c

Add test for 'donkey tubplot'

ca1aeb3

DocGarbanzo added 3 commits October 25, 2022 21:29

Merge remote-tracking branch 'docgarbanzo/cache_augmented_images' int…

3142557

…o cache_augmented_images # Conflicts: # donkeycar/management/base.py

Add --noshow argument to donkey tubplot so we can prevent a blocking …

7e2f437

…call to open the matplotlib window. Use this argument in the test.

Add --noshow argument to donkey tubplot so we can prevent a blocking …

c78058b

…call to open the matplotlib window. Use this argument in the test.

Ezward approved these changes Oct 26, 2022

View reviewed changes

DocGarbanzo added 8 commits October 26, 2022 22:02

Temporarily suppress the assert and change against a listdir so we ca…

e146764

…n see what's going on in the CI for OSX.

Temporarily suppress the assert and change against a listdir so we ca…

d2a8b3d

…n see what's going on in the CI for OSX.

Suppress the tubplot test from OSX because it falls over miraculously…

cbc4557

… when run in the shell subprocess. Needs further investigate, but it works locally. For now, just run under Linux.

Fixed test by switching to Popen.

ade1118

Add sleep time to help failing websocket tests

56abaf8

Increase sleep time in websocket test to 200ms

71060f4

Merge branch 'main' into cache_augmented_images

f003d08

# Conflicts: # donkeycar/tests/test_web_socket.py

Increase sleep time in web socket test.

1f640c7

DocGarbanzo merged commit a40e217 into autorope:main Nov 5, 2022

DocGarbanzo deleted the cache_augmented_images branch November 29, 2022 21:32

Ezward mentioned this pull request Mar 4, 2023

Donkey Car Framework - performance or parallelization problem during training #796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve training performance when using augmentations / crop and minor model improvements #1050

Improve training performance when using augmentations / crop and minor model improvements #1050

DocGarbanzo commented Oct 16, 2022

Ezward commented Oct 24, 2022 •

edited

Loading

DocGarbanzo commented Oct 24, 2022

Ezward left a comment

Improve training performance when using augmentations / crop and minor model improvements #1050

Improve training performance when using augmentations / crop and minor model improvements #1050

Conversation

DocGarbanzo commented Oct 16, 2022

Improve training and simplify model interfaces

Caching

KerasPilot

Other

Ezward commented Oct 24, 2022 • edited Loading

DocGarbanzo commented Oct 24, 2022

Ezward left a comment

Choose a reason for hiding this comment

Ezward commented Oct 24, 2022 •

edited

Loading