Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another round of perf improvements for equalize #6776

Merged
merged 8 commits into from
Oct 21, 2022
Merged

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 14, 2022

This patch was mainly authored by @lezcano, with me only making minor adjustments for JIT.


With this change the kernel "natively" supports arbitrary batch sizes. Furthermore, CUDA execution is quite a bit faster. CPU execution is the same within measuring tolerance.

[------------------------ equalize_image_tensor -------------------------]
                                                        |   main  |    PR 
1 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    269  |    265
      cpu  | (256, 256)          | noncontiguous=True   |    270  |    270
      cpu  | (1, 256, 256)       | noncontiguous=False  |    270  |    270
      cpu  | (1, 256, 256)       | noncontiguous=True   |    270  |    275
      cpu  | (3, 256, 256)       | noncontiguous=False  |    636  |    630
      cpu  | (3, 256, 256)       | noncontiguous=True   |    646  |    645
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   2270  |   2220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   2300  |   2340
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |  12000  |  13700
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |  12000  |  11800
      cuda | (256, 256)          | noncontiguous=False  |    230  |    169
      cuda | (256, 256)          | noncontiguous=True   |    240  |    200
      cuda | (1, 256, 256)       | noncontiguous=False  |    230  |    152
      cuda | (1, 256, 256)       | noncontiguous=True   |    230  |    153
      cuda | (3, 256, 256)       | noncontiguous=False  |    310  |    149
      cuda | (3, 256, 256)       | noncontiguous=True   |    310  |    150
      cuda | (4, 3, 256, 256)    | noncontiguous=False  |   1000  |    280
      cuda | (4, 3, 256, 256)    | noncontiguous=True   |    950  |    290
      cuda | (5, 4, 3, 256, 256) | noncontiguous=False  |   4000  |   1000
      cuda | (5, 4, 3, 256, 256) | noncontiguous=True   |   4000  |   1000
2 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    279  |    266
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    250
      cpu  | (1, 256, 256)       | noncontiguous=False  |    280  |    273
      cpu  | (1, 256, 256)       | noncontiguous=True   |    261  |    250
      cpu  | (3, 256, 256)       | noncontiguous=False  |    460  |    520
      cpu  | (3, 256, 256)       | noncontiguous=True   |    450  |    500
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   1200  |   1200
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   1300  |   1250
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   5900  |   5800
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   6000  |   5900
4 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    280  |    210
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    213
      cpu  | (1, 256, 256)       | noncontiguous=False  |    271  |    212
      cpu  | (1, 256, 256)       | noncontiguous=True   |    261  |    210
      cpu  | (3, 256, 256)       | noncontiguous=False  |    358  |    340
      cpu  | (3, 256, 256)       | noncontiguous=True   |    337  |    303
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |    680  |    682
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |    700  |    687
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   3000  |   3100
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   3100  |   3100

Times are in microseconds (us).
benchmark script

import itertools

import torch
from torch.utils import benchmark

from torchvision.prototype.transforms import functional as F

description = "PR"  # "main", "PR"

devices = ["cpu", "cuda"]
shapes = [
    (256, 256),  # grayscale with missing channel dimension
    (1, 256, 256),  # grayscale
    (3, 256, 256),  # RGB
    (4, 3, 256, 256),  # batch of RGB images or RGB video
    (5, 4, 3, 256, 256),  # batch of RGB videos
]


timers = [
    benchmark.Timer(
        stmt="equalize_image_tensor(input)",
        globals=dict(
            equalize_image_tensor=F.equalize_image_tensor,
            input=torch.testing.make_tensor(
                shape, dtype=torch.uint8, device=device, low=0, high=256, noncontiguous=noncontiguous
            ),
        ),
        label="equalize_image_tensor",
        sub_label=f"{device!s:4} | {shape!s:19} | noncontiguous={noncontiguous}",
        description=description,
        num_threads=num_threads,
    )
    for device, shape, noncontiguous in itertools.product(devices, shapes, [False, True])
    for num_threads in ([1, 2, 4] if device == "cpu" else [1])
]

measurements = [timer.blocked_autorange(min_run_time=5) for timer in timers]

Co-authored-by: lezcano <lezcano-93@hotmail.com>
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great improvements @lezcano and @pmeier. How thoroughly have we tested the new implementation to confirm it returns the same results as previously? The code changed quite substantially from the original approach. If you are confident, I'm happy to merge. Just being mindful that our tests might have minor gaps at the moment, so we should be extra careful.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 20, 2022

I'm currently doing a deep dive into the implementation adding comprehensive comments and the like. I'll ping here if I'm done and convinced everything is correct.

@pmeier pmeier marked this pull request as draft October 20, 2022 12:37
@lezcano
Copy link
Contributor

lezcano commented Oct 20, 2022

I didn't run comprehensive tests, but the result does agree on matrices and batches of matrices. @vfdev-5 already reviewed it, but it's always fine to have it reviewed by a second pair of eyes :)

@datumbox
Copy link
Contributor

@pmeier Sounds great. If possible please compare the outputs of the 2 implementations for a few thousand random inputs. I'll give it another look on my side after you are done too, just to be safe.

@lezcano Understood, we should do a bit more checks on our side but I don't have any concerns at the moment. Thanks a lot for your work on this one.

@pmeier pmeier marked this pull request as ready for review October 21, 2022 07:07
@pmeier
Copy link
Collaborator Author

pmeier commented Oct 21, 2022

@datumbox I've improved the reference test quite a bit to carry more information. The new kernel passes all tests. Plus, I've also run @vfdev-5's benchmark with it a couple of times and nothing triggered there, so I think we are good.

Re-running the benchmark after the recent changes gives:

[------------------------ equalize_image_tensor -------------------------]
                                                        |   main  |    PR 
1 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    250  |    248
      cpu  | (256, 256)          | noncontiguous=True   |    260  |    270
      cpu  | (1, 256, 256)       | noncontiguous=False  |    249  |    270
      cpu  | (1, 256, 256)       | noncontiguous=True   |    260  |    280
      cpu  | (3, 256, 256)       | noncontiguous=False  |    605  |    625
      cpu  | (3, 256, 256)       | noncontiguous=True   |    613  |    640
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   2270  |   2220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   2270  |   2330
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |  13000  |  11400
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |  13000  |  12000
      cuda | (256, 256)          | noncontiguous=False  |    220  |    159
      cuda | (256, 256)          | noncontiguous=True   |    219  |    160
      cuda | (1, 256, 256)       | noncontiguous=False  |    220  |    162
      cuda | (1, 256, 256)       | noncontiguous=True   |    218  |    160
      cuda | (3, 256, 256)       | noncontiguous=False  |    298  |    150
      cuda | (3, 256, 256)       | noncontiguous=True   |    302  |    160
      cuda | (4, 3, 256, 256)    | noncontiguous=False  |    901  |    271
      cuda | (4, 3, 256, 256)    | noncontiguous=True   |    918  |    280
      cuda | (5, 4, 3, 256, 256) | noncontiguous=False  |   3960  |    999
      cuda | (5, 4, 3, 256, 256) | noncontiguous=True   |   3950  |   1020
2 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    260  |    262
      cpu  | (256, 256)          | noncontiguous=True   |    240  |    250
      cpu  | (1, 256, 256)       | noncontiguous=False  |    259  |    260
      cpu  | (1, 256, 256)       | noncontiguous=True   |    260  |    270
      cpu  | (3, 256, 256)       | noncontiguous=False  |    510  |    457
      cpu  | (3, 256, 256)       | noncontiguous=True   |    507  |    450
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |   1220  |   1220
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |   1260  |   1240
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   8250  |   5710
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   7000  |   5900
4 threads: ---------------------------------------------------------------
      cpu  | (256, 256)          | noncontiguous=False  |    270  |    263
      cpu  | (256, 256)          | noncontiguous=True   |    248  |    208
      cpu  | (1, 256, 256)       | noncontiguous=False  |    265  |    227
      cpu  | (1, 256, 256)       | noncontiguous=True   |    255  |    210
      cpu  | (3, 256, 256)       | noncontiguous=False  |    370  |    307
      cpu  | (3, 256, 256)       | noncontiguous=True   |    327  |    290
      cpu  | (4, 3, 256, 256)    | noncontiguous=False  |    670  |    668
      cpu  | (4, 3, 256, 256)    | noncontiguous=True   |    700  |    684
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=False  |   5100  |   3000
      cpu  | (5, 4, 3, 256, 256) | noncontiguous=True   |   4100  |   3100

Times are in microseconds (us).

@pmeier pmeier requested review from vfdev-5 and datumbox October 21, 2022 07:25
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments in case they are helpful, but it LGTM really

torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved
torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved
torchvision/prototype/transforms/functional/_color.py Outdated Show resolved Hide resolved
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's make the changes on the comments, otherwise we are good to go.

@pmeier pmeier merged commit c041798 into pytorch:main Oct 21, 2022
@pmeier pmeier deleted the equalize branch October 21, 2022 11:15
facebook-github-bot pushed a commit that referenced this pull request Oct 21, 2022
Summary:
* perf improvements for equalize

* improve reference tests

* add extensive comments and minor fixes to the kernel

* improve comments

Reviewed By: YosuaMichael

Differential Revision: D40588160

fbshipit-source-id: ffe05fa6aa188a3d2dfe98f4367cb1d81abe1e47

Co-authored-by: lezcano <lezcano-93@hotmail.com>
Co-authored-by: lezcano <lezcano-93@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants