-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[prototype] Speed up Augment Transform Classes #6835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if isinstance(value, (int, float)): | ||
self.value = [value] | ||
elif isinstance(value, str): | ||
self.value = None | ||
elif isinstance(value, tuple): | ||
self.value = list(value) | ||
else: | ||
self.value = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving value
handling from run-time to constructor. This is more code-quality than perf.
raise ValueError("Need a batch of one hot labels") | ||
output = inpt.clone() | ||
output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam)) | ||
output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we saw in previous optimizations (for example #6821), removing clone in favour of moving the copy on the first multiplication has speed benefits. Below I benchmark the following two alternatives for various inputs (images and labels) to showcase this:
def withclone(inpt, lam=0.5):
output = inpt.clone()
return output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam))
def withoutclone(inpt, lam=0.5):
return inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam))
Below we benchmark the 2 above methods alone:
[-------------- Mixing cpu torch.float32 --------------]
| withclone | withoutclone
1 threads: ---------------------------------------------
(16, 3, 400, 400) | 17200 | 15100
(16, 1000) | 22 | 21
6 threads: ---------------------------------------------
(16, 3, 400, 400) | 17700 | 15400
(16, 1000) | 23 | 20
Times are in microseconds (us).
[------------- Mixing cuda torch.float32 --------------]
| withclone | withoutclone
1 threads: ---------------------------------------------
(16, 3, 400, 400) | 275.3 | 226
(16, 1000) | 32.8 | 26
6 threads: ---------------------------------------------
(16, 3, 400, 400) | 275.9 | 226
(16, 1000) | 32.6 | 26
Times are in microseconds (us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it took me a while to figure out why the clone
is needed in the first place: without it,
inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))
would modify inpt
inplace due to inpt.mul_(lam)
. This is the inplace operation @datumbox talking above. inpt.roll
always return a copy and thus we can use inplace operations afterwards.
inverse_paste_alpha_mask = ~paste_alpha_mask | ||
# Copy-paste images: | ||
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask) | ||
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply cache ~paste_alpha_mask
to avoid repeated estimations (see code below this point) and use of in-place addition on the intermediate results. I don't have benchmarks for this because I thought it's quite uncontroversial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would prefer not to mix the regular operators like *
and +
with the flow interface like .add_
and .mul_
. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present
image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: We could perform inplace operations on paste_image
in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if CI is green. Thanks Vasilis!
raise ValueError("Need a batch of one hot labels") | ||
output = inpt.clone() | ||
output = output.roll(1, 0).mul_(1.0 - lam).add_(output.mul_(lam)) | ||
output = inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul(lam)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it took me a while to figure out why the clone
is needed in the first place: without it,
inpt.roll(1, 0).mul_(1.0 - lam).add_(inpt.mul_(lam))
would modify inpt
inplace due to inpt.mul_(lam)
. This is the inplace operation @datumbox talking above. inpt.roll
always return a copy and thus we can use inplace operations afterwards.
inverse_paste_alpha_mask = ~paste_alpha_mask | ||
# Copy-paste images: | ||
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask) | ||
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would prefer not to mix the regular operators like *
and +
with the flow interface like .add_
and .mul_
. Since the regular operators don't support inplace operations on the right side of a statement, this means I would prefer using the flow style whenever inplace operations are present
image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
inverse_paste_alpha_mask = ~paste_alpha_mask | ||
# Copy-paste images: | ||
image = (image * (~paste_alpha_mask)) + (paste_image * paste_alpha_mask) | ||
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: We could perform inplace operations on paste_image
in case we needed to resize it above. Otherwise it is one of the inputs and thus we need to do out of place operations like it is done here.
if blending: | ||
paste_alpha_mask = F.gaussian_blur(paste_alpha_mask.unsqueeze(0), kernel_size=[5, 5], sigma=[2.0]) | ||
|
||
inverse_paste_alpha_mask = ~paste_alpha_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought of another "evil hack" that avoids the out of place inversion here. There are two places where paste_alpha_mask
or its inverse is used:
image = (image * inverse_paste_alpha_mask).add_(paste_image * paste_alpha_mask)
masks = masks * inverse_paste_alpha_mask
By reordering the first computation to
image = (paste_image * paste_alpha_mask).add_(image * inverse_paste_alpha_mask)
we no longer need the paste_alpha_mask
after inverse_paste_alpha_mask
is used the first time. Thus, we can perform inplace operations on it:
image = (paste_image * paste_alpha_mask).add_(image * paste_alpha_mask.logical_not_())
That turns the 2. statement into
`masks = masks * paste_alpha_mask`
Although that looks wrong, it is actually correct since paste_alpha_mask
at this point contains it inverse.
Going for this needs at least an extensive comment since it is not obvious. Not sure if it is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier I'm not 100% sure but I sense that this will end up having the same number of operations as before. You indeed do an inplace operation but at the end you still need to reverse the bool an write the results back to memory. I think what you propose will lead to less memory footprint. Still I think your proposal is worth trying. Do you want to bring it as an optimization on a PR after this one? You could refactor it a bit to split this into multiple lines and avoid doing one non-inplace op with the inplace op on paste_alpha_mask
to make things clearer to the reader (that's debatable, maybe it's clear already?). You could literally measure only those 2 lines on the new PR and show-case if the proposal is faster. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right:
def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
inverse_paste_alpha_mask = paste_alpha_mask.logical_not()
# Copy-paste images:
image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
# Copy-paste masks:
masks = masks * inverse_paste_alpha_mask
def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
if size1_ne_size2:
tmp = paste_image.mul_(paste_alpha_mask)
else:
tmp = paste_image.mul(paste_alpha_mask)
inverse_paste_alpha_mask = paste_alpha_mask.logical_not_()
# Copy-paste images:
image = tmp.add_(image.mul(inverse_paste_alpha_mask))
# Copy-paste masks:
masks = masks.mul(inverse_paste_alpha_mask)
Benchmark script
import itertools
import torch
from torch.utils import benchmark
shapes = [
(3, 512, 512), # single image
(5, 3, 512, 512), # single video
]
dtypes = [torch.uint8, torch.float32]
devices = ["cpu", "cuda"]
def current(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
inverse_paste_alpha_mask = paste_alpha_mask.logical_not()
# Copy-paste images:
image = image.mul(inverse_paste_alpha_mask).add_(paste_image.mul(paste_alpha_mask))
# Copy-paste masks:
masks = masks * inverse_paste_alpha_mask
def proposal(*, image, masks, paste_image, paste_alpha_mask, size1_ne_size2):
if size1_ne_size2:
tmp = paste_image.mul_(paste_alpha_mask)
else:
tmp = paste_image.mul(paste_alpha_mask)
inverse_paste_alpha_mask = paste_alpha_mask.logical_not_()
# Copy-paste images:
image = tmp.add_(image.mul(inverse_paste_alpha_mask))
# Copy-paste masks:
masks = masks.mul(inverse_paste_alpha_mask)
timers = [
benchmark.Timer(
stmt="fn(image=image, masks=masks, paste_image=paste_image, paste_alpha_mask=paste_alpha_mask, size1_ne_size2=size1_ne_size2)",
globals=dict(
fn=fn,
image=torch.testing.make_tensor(
shape,
dtype=dtype,
device=device,
low=0,
high=1 if dtype.is_floating_point else torch.iinfo(dtype).max,
),
masks=torch.testing.make_tensor(
(4, *shape),
dtype=torch.uint8,
device=device,
low=0,
high=2,
),
paste_image=torch.testing.make_tensor(
shape,
dtype=dtype,
device=device,
low=0,
high=1 if dtype.is_floating_point else torch.iinfo(dtype).max,
),
paste_alpha_mask=torch.testing.make_tensor(
shape,
dtype=torch.bool,
device=device,
low=0,
high=2,
),
size1_ne_size2=size1_ne_size2,
),
sub_label=f"{shape!s:16} / {str(dtype).replace('torch.', ''):7} / {device:4} / size1_ne_size2={size1_ne_size2}",
description=fn.__name__,
)
for fn, shape, dtype, device, size1_ne_size2 in itertools.product(
[
current,
proposal,
],
shapes,
dtypes,
devices,
[False, True],
)
]
measurements = [timer.blocked_autorange(min_run_time=10) for timer in timers]
comparison = benchmark.Compare(measurements)
comparison.print()
| current | proposal
1 threads: ----------------------------------------------------------------------------
(3, 512, 512) / uint8 / cpu / size1_ne_size2=False | 1964.5 | 2155.9
(3, 512, 512) / uint8 / cpu / size1_ne_size2=True | 1956.6 | 2169.5
(3, 512, 512) / uint8 / cuda / size1_ne_size2=False | 142.8 | 140.8
(3, 512, 512) / uint8 / cuda / size1_ne_size2=True | 143.4 | 139.6
(3, 512, 512) / float32 / cpu / size1_ne_size2=False | 2053.3 | 2270.9
(3, 512, 512) / float32 / cpu / size1_ne_size2=True | 2052.9 | 2217.7
(3, 512, 512) / float32 / cuda / size1_ne_size2=False | 196.9 | 197.0
(3, 512, 512) / float32 / cuda / size1_ne_size2=True | 196.9 | 196.7
(5, 3, 512, 512) / uint8 / cpu / size1_ne_size2=False | 10575.0 | 10932.7
(5, 3, 512, 512) / uint8 / cpu / size1_ne_size2=True | 10541.8 | 10887.1
(5, 3, 512, 512) / uint8 / cuda / size1_ne_size2=False | 695.4 | 696.4
(5, 3, 512, 512) / uint8 / cuda / size1_ne_size2=True | 695.2 | 697.6
(5, 3, 512, 512) / float32 / cpu / size1_ne_size2=False | 14170.3 | 14315.7
(5, 3, 512, 512) / float32 / cpu / size1_ne_size2=True | 14110.5 | 13735.6
(5, 3, 512, 512) / float32 / cuda / size1_ne_size2=False | 946.1 | 949.2
(5, 3, 512, 512) / float32 / cuda / size1_ne_size2=True | 948.9 | 951.2
Times are in microseconds (us).
- It makes no difference on CUDA
- There is a slowdown on CPU
- Only for the combination
(5, 3, 512, 512) / float32 / cpu / size1_ne_size2=True
there is a speed-up, of ~400µs. It is likely that the speed-up comes from thesize1_ne_size2=True
condition that I suggested in [prototype] Speed up Augment Transform Classes #6835 (comment).
I would leave it as is.
Summary: * Moving value estimation of `RandomErasing` from runtime to constructor. * Speed up mixing on MixUp/Cutmix and small optimization on SimpleCopyPaste. * Apply nits. Reviewed By: YosuaMichael Differential Revision: D40722897 fbshipit-source-id: feb2a56ec9d0a80991e5a2acaff4846d82168639
This PR does a few optimizations on the
_augment.py
Transform classes.cc @vfdev-5 @bjuncek @pmeier