Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use bitshifts for int to int in convert_dtype #6978

Merged
merged 2 commits into from
Nov 28, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Nov 25, 2022

[--- convert_dtype int -> int (down) @ torchvision==0.15.0a0+d5da6e4 ---]
                                        |        v1       |       v2     
1 threads: --------------------------------------------------------------
      (3, 512, 512) / int32 / cpu       |   1074 (+-  5)  |  100 (+-  0) 
      (3, 512, 512) / int32 / cuda      |    42 (+-  1)   |   41 (+-  0) 
      (16, 3, 512, 512) / int32 / cpu   |  21319 (+-291)  |  7362 (+-211)
      (16, 3, 512, 512) / int32 / cuda  |   644 (+-  1)   |  644 (+-  2) 
6 threads: --------------------------------------------------------------
      (3, 512, 512) / int32 / cpu       |   208 (+-  1)   |   33 (+-  0) 
      (16, 3, 512, 512) / int32 / cpu   |   5545 (+- 75)  |  4820 (+-112)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +37.5% (improvement)

[---- convert_dtype int -> int (up) @ torchvision==0.15.0a0+d5da6e4 ----]
                                        |        v1       |       v2     
1 threads: --------------------------------------------------------------
      (3, 512, 512) / uint8 / cpu       |   123 (+-  1)   |   90 (+-  1) 
      (3, 512, 512) / uint8 / cuda      |    46 (+-  1)   |   43 (+-  1) 
      (16, 3, 512, 512) / uint8 / cpu   |  10211 (+-169)  |  5897 (+-125)
      (16, 3, 512, 512) / uint8 / cuda  |   659 (+-  8)   |  658 (+- 19) 
6 threads: --------------------------------------------------------------
      (3, 512, 512) / uint8 / cpu       |    43 (+-  0)   |   30 (+-  0) 
      (16, 3, 512, 512) / uint8 / cpu   |   8715 (+-303)  |  2152 (+-107)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +19.7% (improvement)

Last time we measured this was in #6795 (review). In this PR the label main is equivalent to v1 above, since it that time, we just aliased the v1 kernel. These are the relevant lines for comparison:

                             |  main  |   PR
1 threads: ---------------------------------
        int to   int (down)  |  1100  |  402
        int to   int (up)    |   127  |   92

Times are in microseconds (us).

For converting down, i.e. int32 to uint8, this PR doesn't change anything. Still, this branch benefits from the bitshift vectorization:

  • previously: 1100 us -> 402 us: 63.5%
  • now: 1074 us -> 100 us: 90.7%

For converting up, i.e. uint8 to int32, we replace an inplace multiplication with an inplace bitwise_left_shift_. However, the result is underwhelming

  • previously: 127 us -> 92 us: 27.6%
  • now: 123 us -> 90 us: 26.8%

Meaning, this PR does nothing for performance. The cleanup is good, but I expected some gains here.

@alexsamardzic Did I misunderstand your PRs? Shouldn't bitwise_left_shift_ be quite a bit faster than mul_ on an int32 tensor and with Python scalars as inputs.

cc @vfdev-5 @datumbox @bjuncek

@datumbox
Copy link
Contributor

@pmeier Trying to understand the history around this. On your original comment (old PR), you have indicated that this approach is not faster due to the lack of vectorization on Core. But now you find it about 26% faster, which is not as fast as you expect. Is my understanding correct? Was there work around the specific op on Core that justifies the difference between the previous attempt and now?

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 28, 2022

I agree this is somewhat confusing, since I needed to mix two benchmarks above. Let me try to untangle it for you:

On your original comment (old PR), you have indicated that this approach is not faster due to the lack of vectorization on Core.

That was true at the time of writing it.

But now you find it about 26% faster, which is not as fast as you expect. Is my understanding correct?

I find it 26% faster than v1. This is the same as for the multiplication that we currently have in v2. Meaning,

  • v2 (main) vs. v1: +27%
  • v2 (this PR) vs. v1: +26%
  • v2 (this PR) vs v2 (main): approx 0%

Was there work around the specific op on Core that justifies the difference between the previous attempt and now?

Yes, the ops were vectorized in core by @alexsamardzic in pytorch/pytorch#88607 and pytorch/pytorch#89284. We are discussing offline as we speak if my benchmark in this PR results are correct or not. Will update here if we reached a conclusion.

Due to these changes on core, you see this speedup in the other branch (int to int downscale) that uses bitshifts, but wasn't touched by this PR at all.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier Thanks for the explanation. The change looks good in terms of simplifying the code. Let's see what @alexsamardzic has to say concerning the results on speed.

@alexsamardzic
Copy link

alexsamardzic commented Nov 28, 2022

The shift operators were not vectorized previosly. Recently merged PRs pytorch/pytorch#88607, pytorch/pytorch#88990 and pytorch/pytorch#89284 implement vectorization of shift operators for all integer datatypes, and this is where the performance improvement in the above benchmarks comes from. On the other side, multiplication operator was already vectorized for some datatypes, including int32. An operator being vectorized means that it's executed through a single (in most cases) assembly instruction, that operates on several tensor(s) elements in parallel. Corresponding assembly instructions for shifts and multiplications are typically close in number of processor cycles required (it depends on particular processor), and results of the last benchmark are as expected in that regards.

@pmeier pmeier marked this pull request as ready for review November 28, 2022 09:08
@pmeier pmeier merged commit 346f6dd into pytorch:main Nov 28, 2022
@pmeier pmeier deleted the convert-dtype branch November 28, 2022 09:25
facebook-github-bot pushed a commit that referenced this pull request Nov 28, 2022
Reviewed By: jdsgomes

Differential Revision: D41548197

fbshipit-source-id: 34b1f8638e3832db45d4742b1c89ab18022886f0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants