[prototype] Speed improvement for normalize op #6821

datumbox · 2022-10-24T13:04:55Z

This PR:

Avoids the (std == 0).any() idiom which caused synchronization between CPU and GPU (50% improvement in CUDA)
Minimizes memory writes by avoiding cloning (25% improvement on CPU)

The combined result is:

Float mean/std:
[------------- Normalize cpu torch.float32 -------------]
                     |  normalize stable  |  normalize v2
1 threads: ----------------------------------------------
      (3, 400, 400)  |        364         |      264     
6 threads: ----------------------------------------------
      (3, 400, 400)  |        497         |      351     

Times are in microseconds (us).

[------------- Normalize cuda torch.float32 ------------]
                     |  normalize stable  |  normalize v2
1 threads: ----------------------------------------------
      (3, 400, 400)  |        118         |      55.6    
6 threads: ----------------------------------------------
      (3, 400, 400)  |        118         |      55.6    

Times are in microseconds (us).


List mean/std:
[------------- Normalize cpu torch.float32 -------------]
                     |  normalize stable  |  normalize v2
1 threads: ----------------------------------------------
      (3, 400, 400)  |        378         |      271     
6 threads: ----------------------------------------------
      (3, 400, 400)  |        513         |      360     

Times are in microseconds (us).

[------------- Normalize cuda torch.float32 ------------]
                     |  normalize stable  |  normalize v2
1 threads: ----------------------------------------------
      (3, 400, 400)  |        116         |      61.6    
6 threads: ----------------------------------------------
      (3, 400, 400)  |        116         |      61.6    

Times are in microseconds (us).

Modified benchmark script from here

cc @vfdev-5 @bjuncek @pmeier

pmeier

I looked at this earlier and saw no possible optimizations. It seems you have better eyes 😛

LGTM, if CI is green.

torchvision/prototype/transforms/functional/_misc.py

pmeier · 2022-10-24T13:10:31Z

torchvision/prototype/transforms/functional/_misc.py

+            f"Expected tensor to be a tensor image of size (..., C, H, W). Got tensor.size() = {image.size()}"
+        )
+
+    if (isinstance(std, (tuple, list)) and not all(std)) or std == 0:


Do we need the first part of the check? What input would fail isinstance(std, (tuple, list))? Do we actually allow scalars here? Otherwise, this should be sufficient

Suggested change

if (isinstance(std, (tuple, list)) and not all(std)) or std == 0:

if not all(std):

We actually allow scalars. It's not visible due to the JIT-script types but if you pass mean=0.5, std=0.5 it works. So I'm keeping this for BC and provide separate benchmarks.

Ugh 🙄 We need to update the tests since they currently don't check scalars:

vision/test/prototype_transforms_kernel_infos.py

Lines 1945 to 1956 in 6979888

_NORMALIZE_MEANS_STDS = [

((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),

([0.0, 0.0, 0.0], [1.0, 1.0, 1.0]),

]

def sample_inputs_normalize_image_tensor():

for image_loader, (mean, std) in itertools.product(

make_image_loaders(sizes=["random"], color_spaces=[features.ColorSpace.RGB], dtypes=[torch.float32]),

_NORMALIZE_MEANS_STDS,

):

yield ArgsKwargs(image_loader, mean=mean, std=std)

Will send a PR.

Good catch. I also had to rewrite the check because JIT couldn't understand the assertions were correct in one line... This version seems to pass. I've updated the benchmarks and we are still good.

torchvision/prototype/transforms/functional/_misc.py

pmeier · 2022-10-24T13:12:55Z

torchvision/prototype/transforms/functional/_misc.py

+    if mean.ndim == 1:
+        mean = mean.view(-1, 1, 1)
+    if std.ndim == 1:
+        std = std.view(-1, 1, 1)


I was also looking into this earlier and one thing I asked myself, is when would this branch not trigger? The tensor should always have one dimensions unless we allow scalars. See above for that.

I think this is purely for broadcasting in case someone passes lists, not scalars. Aka [0.5, 0.5, 0.5]. This is needed else, the following div/sub fails.

…ype/normalize

Summary: * Avoid GPU-CPU sync on Normalize * Further optimizations. * Apply code review changes. * Fixing JIT. * linter fix Reviewed By: YosuaMichael Differential Revision: D40722904 fbshipit-source-id: e452d89a42b34be852e3125d25756b3f598e50f4

datumbox added 2 commits October 24, 2022 13:39

Avoid GPU-CPU sync on Normalize

31954e5

Further optimizations.

2080725

datumbox added module: transforms Perf For performance improvements prototype labels Oct 24, 2022

datumbox requested review from vfdev-5 and pmeier October 24, 2022 13:04

facebook-github-bot added the cla signed label Oct 24, 2022

datumbox mentioned this pull request Oct 24, 2022

Performance improvements for transforms v2 vs. v1 #6818

Closed

31 tasks

pmeier approved these changes Oct 24, 2022

View reviewed changes

datumbox and others added 5 commits October 24, 2022 14:27

Apply code review changes.

8842afa

Merge branch 'main' into prototype/normalize

5687c9e

Fixing JIT.

afd5b1e

Merge remote-tracking branch 'origin/prototype/normalize' into protot…

4131edb

…ype/normalize

linter fix

9a1de92

datumbox merged commit 788ad12 into pytorch:main Oct 24, 2022

datumbox deleted the prototype/normalize branch October 24, 2022 14:01

datumbox mentioned this pull request Oct 25, 2022

[prototype] Speed up Augment Transform Classes #6835

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prototype] Speed improvement for normalize op #6821

[prototype] Speed improvement for normalize op #6821

datumbox commented Oct 24, 2022 •

edited

Loading

pmeier left a comment

pmeier Oct 24, 2022

datumbox Oct 24, 2022

pmeier Oct 24, 2022

datumbox Oct 24, 2022

pmeier Oct 24, 2022

datumbox Oct 24, 2022

	if (isinstance(std, (tuple, list)) and not all(std)) or std == 0:
	if not all(std):

	_NORMALIZE_MEANS_STDS = [
	((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
	([0.0, 0.0, 0.0], [1.0, 1.0, 1.0]),
	]


	def sample_inputs_normalize_image_tensor():
	for image_loader, (mean, std) in itertools.product(
	make_image_loaders(sizes=["random"], color_spaces=[features.ColorSpace.RGB], dtypes=[torch.float32]),
	_NORMALIZE_MEANS_STDS,
	):
	yield ArgsKwargs(image_loader, mean=mean, std=std)

[prototype] Speed improvement for normalize op #6821

[prototype] Speed improvement for normalize op #6821

Conversation

datumbox commented Oct 24, 2022 • edited Loading

pmeier left a comment

Choose a reason for hiding this comment

pmeier Oct 24, 2022

Choose a reason for hiding this comment

datumbox Oct 24, 2022

Choose a reason for hiding this comment

pmeier Oct 24, 2022

Choose a reason for hiding this comment

datumbox Oct 24, 2022

Choose a reason for hiding this comment

pmeier Oct 24, 2022

Choose a reason for hiding this comment

datumbox Oct 24, 2022

Choose a reason for hiding this comment

datumbox commented Oct 24, 2022 •

edited

Loading