another round of perf improvements for equalize #6776

pmeier · 2022-10-14T16:10:50Z

This patch was mainly authored by @lezcano, with me only making minor adjustments for JIT.

With this change the kernel "natively" supports arbitrary batch sizes. Furthermore, CUDA execution is quite a bit faster. CPU execution is the same within measuring tolerance.

[------------------------ equalize_image_tensor -------------------------]
                                                        |   main  |    PR 
1 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    269  |    265
      cpu  | (256, 256)          | noncontiguous=True   |    270  |    270
      cpu  | (1, 256, 256)       | noncontiguous=False  |    270  |    270
      cpu  | (1, 256, 256)       | noncontiguous=True   |    270  |    275
      cpu  | (3, 256, 256)       | noncontiguous=False  |    636  |    630
      cpu  | (3, 256, 256)       | noncontiguous=True   |    646  |    645
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   2270  |   2220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   2300  |   2340
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |  12000  |  13700
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |  12000  |  11800
      cuda | (256, 256)          | noncontiguous=False  |    230  |    169
      cuda | (256, 256)          | noncontiguous=True   |    240  |    200
      cuda | (1, 256, 256)       | noncontiguous=False  |    230  |    152
      cuda | (1, 256, 256)       | noncontiguous=True   |    230  |    153
      cuda | (3, 256, 256)       | noncontiguous=False  |    310  |    149
      cuda | (3, 256, 256)       | noncontiguous=True   |    310  |    150
      cuda | (4, 3, 256, 256)    | noncontiguous=False  |   1000  |    280
      cuda | (4, 3, 256, 256)    | noncontiguous=True   |    950  |    290
      cuda | (5, 4, 3, 256, 256) | noncontiguous=False  |   4000  |   1000
      cuda | (5, 4, 3, 256, 256) | noncontiguous=True   |   4000  |   1000
2 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    279  |    266
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    250
      cpu  | (1, 256, 256)       | noncontiguous=False  |    280  |    273
      cpu  | (1, 256, 256)       | noncontiguous=True   |    261  |    250
      cpu  | (3, 256, 256)       | noncontiguous=False  |    460  |    520
      cpu  | (3, 256, 256)       | noncontiguous=True   |    450  |    500
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   1200  |   1200
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   1300  |   1250
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   5900  |   5800
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   6000  |   5900
4 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    280  |    210
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    213
      cpu  | (1, 256, 256)       | noncontiguous=False  |    271  |    212
      cpu  | (1, 256, 256)       | noncontiguous=True   |    261  |    210
      cpu  | (3, 256, 256)       | noncontiguous=False  |    358  |    340
      cpu  | (3, 256, 256)       | noncontiguous=True   |    337  |    303
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |    680  |    682
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |    700  |    687
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   3000  |   3100
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   3100  |   3100

Times are in microseconds (us).

benchmark script

import itertools

import torch
from torch.utils import benchmark

from torchvision.prototype.transforms import functional as F

description = "PR"  # "main", "PR"

devices = ["cpu", "cuda"]
shapes = [
    (256, 256),  # grayscale with missing channel dimension
    (1, 256, 256),  # grayscale
    (3, 256, 256),  # RGB
    (4, 3, 256, 256),  # batch of RGB images or RGB video
    (5, 4, 3, 256, 256),  # batch of RGB videos
]


timers = [
    benchmark.Timer(
        stmt="equalize_image_tensor(input)",
        globals=dict(
            equalize_image_tensor=F.equalize_image_tensor,
            input=torch.testing.make_tensor(
                shape, dtype=torch.uint8, device=device, low=0, high=256, noncontiguous=noncontiguous
            ),
        ),
        label="equalize_image_tensor",
        sub_label=f"{device!s:4} | {shape!s:19} | noncontiguous={noncontiguous}",
        description=description,
        num_threads=num_threads,
    )
    for device, shape, noncontiguous in itertools.product(devices, shapes, [False, True])
    for num_threads in ([1, 2, 4] if device == "cpu" else [1])
]

measurements = [timer.blocked_autorange(min_run_time=5) for timer in timers]

Co-authored-by: lezcano <lezcano-93@hotmail.com>

torchvision/prototype/transforms/functional/_color.py

datumbox

Thanks for the great improvements @lezcano and @pmeier. How thoroughly have we tested the new implementation to confirm it returns the same results as previously? The code changed quite substantially from the original approach. If you are confident, I'm happy to merge. Just being mindful that our tests might have minor gaps at the moment, so we should be extra careful.

pmeier · 2022-10-20T12:36:56Z

I'm currently doing a deep dive into the implementation adding comprehensive comments and the like. I'll ping here if I'm done and convinced everything is correct.

lezcano · 2022-10-20T13:03:04Z

I didn't run comprehensive tests, but the result does agree on matrices and batches of matrices. @vfdev-5 already reviewed it, but it's always fine to have it reviewed by a second pair of eyes :)

datumbox · 2022-10-20T13:11:30Z

@pmeier Sounds great. If possible please compare the outputs of the 2 implementations for a few thousand random inputs. I'll give it another look on my side after you are done too, just to be safe.

@lezcano Understood, we should do a bit more checks on our side but I don't have any concerns at the moment. Thanks a lot for your work on this one.

pmeier · 2022-10-21T07:25:27Z

@datumbox I've improved the reference test quite a bit to carry more information. The new kernel passes all tests. Plus, I've also run @vfdev-5's benchmark with it a couple of times and nothing triggered there, so I think we are good.

Re-running the benchmark after the recent changes gives:

[------------------------ equalize_image_tensor -------------------------]
                                                        |   main  |    PR 
1 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    250  |    248
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    270
      cpu  | (1, 256, 256)       | noncontiguous=False  |    249  |    270
      cpu  | (1, 256, 256)       | noncontiguous=True   |    260  |    280
      cpu  | (3, 256, 256)       | noncontiguous=False  |    605  |    625
      cpu  | (3, 256, 256)       | noncontiguous=True   |    613  |    640
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   2270  |   2220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   2270  |   2330
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |  13000  |  11400
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |  13000  |  12000
      cuda | (256, 256)          | noncontiguous=False  |    220  |    159
      cuda | (256, 256)          | noncontiguous=True   |    219  |    160
      cuda | (1, 256, 256)       | noncontiguous=False  |    220  |    162
      cuda | (1, 256, 256)       | noncontiguous=True   |    218  |    160
      cuda | (3, 256, 256)       | noncontiguous=False  |    298  |    150
      cuda | (3, 256, 256)       | noncontiguous=True   |    302  |    160
      cuda | (4, 3, 256, 256)    | noncontiguous=False  |    901  |    271
      cuda | (4, 3, 256, 256)    | noncontiguous=True   |    918  |    280
      cuda | (5, 4, 3, 256, 256) | noncontiguous=False  |   3960  |    999
      cuda | (5, 4, 3, 256, 256) | noncontiguous=True   |   3950  |   1020
2 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    260  |    262
      cpu  | (256, 256)          | noncontiguous=True   |    240  |    250
      cpu  | (1, 256, 256)       | noncontiguous=False  |    259  |    260
      cpu  | (1, 256, 256)       | noncontiguous=True   |    260  |    270
      cpu  | (3, 256, 256)       | noncontiguous=False  |    510  |    457
      cpu  | (3, 256, 256)       | noncontiguous=True   |    507  |    450
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   1220  |   1220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   1260  |   1240
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   8250  |   5710
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   7000  |   5900
4 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    270  |    263
      cpu  | (256, 256)          | noncontiguous=True   |    248  |    208
      cpu  | (1, 256, 256)       | noncontiguous=False  |    265  |    227
      cpu  | (1, 256, 256)       | noncontiguous=True   |    255  |    210
      cpu  | (3, 256, 256)       | noncontiguous=False  |    370  |    307
      cpu  | (3, 256, 256)       | noncontiguous=True   |    327  |    290
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |    670  |    668
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |    700  |    684
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   5100  |   3000
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   4100  |   3100

Times are in microseconds (us).

lezcano

I left a few comments in case they are helpful, but it LGTM really

torchvision/prototype/transforms/functional/_color.py

datumbox

LGTM. Let's make the changes on the comments, otherwise we are good to go.

Summary: * perf improvements for equalize * improve reference tests * add extensive comments and minor fixes to the kernel * improve comments Reviewed By: YosuaMichael Differential Revision: D40588160 fbshipit-source-id: ffe05fa6aa188a3d2dfe98f4367cb1d81abe1e47 Co-authored-by: lezcano <lezcano-93@hotmail.com> Co-authored-by: lezcano <lezcano-93@hotmail.com>

perf improvements for equalize

e7c5fcb

Co-authored-by: lezcano <lezcano-93@hotmail.com>

pmeier added module: transforms Perf For performance improvements prototype labels Oct 14, 2022

pmeier requested review from datumbox and vfdev-5 October 14, 2022 16:10

facebook-github-bot added the cla signed label Oct 14, 2022

vfdev-5 reviewed Oct 14, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

Merge branch 'main' into equalize

20b78d2

vfdev-5 reviewed Oct 19, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved

Merge branch 'main' into equalize

d0fc56f

datumbox reviewed Oct 20, 2022

View reviewed changes

pmeier marked this pull request as draft October 20, 2022 12:37

pmeier added 3 commits October 21, 2022 08:21

improve reference tests

5291e34

add extensive comments and minor fixes to the kernel

5aa4f8b

Merge branch 'main' into equalize

f81a841

pmeier marked this pull request as ready for review October 21, 2022 07:07

pmeier requested review from datumbox and vfdev-5 October 21, 2022 07:25

lezcano approved these changes Oct 21, 2022

View reviewed changes

datumbox approved these changes Oct 21, 2022

View reviewed changes

pmeier added 2 commits October 21, 2022 10:05

improve comments

6d448cb

Merge branch 'main' into equalize

3b47109

pmeier merged commit c041798 into pytorch:main Oct 21, 2022

pmeier deleted the equalize branch October 21, 2022 11:15

pmeier mentioned this pull request Oct 24, 2022

Performance improvements for transforms v2 vs. v1 #6818

Closed

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

another round of perf improvements for equalize #6776

another round of perf improvements for equalize #6776

Uh oh!

pmeier commented Oct 14, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Uh oh!

pmeier commented Oct 20, 2022

Uh oh!

lezcano commented Oct 20, 2022

Uh oh!

datumbox commented Oct 20, 2022

Uh oh!

pmeier commented Oct 21, 2022

Uh oh!

lezcano left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

another round of perf improvements for equalize #6776

another round of perf improvements for equalize #6776

Uh oh!

Conversation

pmeier commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

pmeier commented Oct 20, 2022

Uh oh!

lezcano commented Oct 20, 2022

Uh oh!

datumbox commented Oct 20, 2022

Uh oh!

pmeier commented Oct 21, 2022

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pmeier commented Oct 14, 2022 •

edited

Loading