use bitshifts for int to int in convert_dtype #6978

pmeier · 2022-11-25T18:25:08Z

[--- convert_dtype int -> int (down) @ torchvision==0.15.0a0+d5da6e4 ---]
                                        |        v1       |       v2     
1 threads: --------------------------------------------------------------
      (3, 512, 512) / int32 / cpu       |   1074 (+-  5)  |  100 (+-  0) 
      (3, 512, 512) / int32 / cuda      |    42 (+-  1)   |   41 (+-  0) 
      (16, 3, 512, 512) / int32 / cpu   |  21319 (+-291)  |  7362 (+-211)
      (16, 3, 512, 512) / int32 / cuda  |   644 (+-  1)   |  644 (+-  2) 
6 threads: --------------------------------------------------------------
      (3, 512, 512) / int32 / cpu       |   208 (+-  1)   |   33 (+-  0) 
      (16, 3, 512, 512) / int32 / cpu   |   5545 (+- 75)  |  4820 (+-112)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +37.5% (improvement)

[---- convert_dtype int -> int (up) @ torchvision==0.15.0a0+d5da6e4 ----]
                                        |        v1       |       v2     
1 threads: --------------------------------------------------------------
      (3, 512, 512) / uint8 / cpu       |   123 (+-  1)   |   90 (+-  1) 
      (3, 512, 512) / uint8 / cuda      |    46 (+-  1)   |   43 (+-  1) 
      (16, 3, 512, 512) / uint8 / cpu   |  10211 (+-169)  |  5897 (+-125)
      (16, 3, 512, 512) / uint8 / cuda  |   659 (+-  8)   |  658 (+- 19) 
6 threads: --------------------------------------------------------------
      (3, 512, 512) / uint8 / cpu       |    43 (+-  0)   |   30 (+-  0) 
      (16, 3, 512, 512) / uint8 / cpu   |   8715 (+-303)  |  2152 (+-107)

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: +19.7% (improvement)

Last time we measured this was in #6795 (review). In this PR the label main is equivalent to v1 above, since it that time, we just aliased the v1 kernel. These are the relevant lines for comparison:

                             |  main  |   PR
1 threads: ---------------------------------
        int to   int (down)  |  1100  |  402
        int to   int (up)    |   127  |   92

Times are in microseconds (us).

For converting down, i.e. int32 to uint8, this PR doesn't change anything. Still, this branch benefits from the bitshift vectorization:

previously: 1100 us -> 402 us: 63.5%
now: 1074 us -> 100 us: 90.7%

For converting up, i.e. uint8 to int32, we replace an inplace multiplication with an inplace bitwise_left_shift_. However, the result is underwhelming

previously: 127 us -> 92 us: 27.6%
now: 123 us -> 90 us: 26.8%

Meaning, this PR does nothing for performance. The cleanup is good, but I expected some gains here.

@alexsamardzic Did I misunderstand your PRs? Shouldn't bitwise_left_shift_ be quite a bit faster than mul_ on an int32 tensor and with Python scalars as inputs.

cc @vfdev-5 @datumbox @bjuncek

datumbox · 2022-11-28T08:35:55Z

@pmeier Trying to understand the history around this. On your original comment (old PR), you have indicated that this approach is not faster due to the lack of vectorization on Core. But now you find it about 26% faster, which is not as fast as you expect. Is my understanding correct? Was there work around the specific op on Core that justifies the difference between the previous attempt and now?

pmeier · 2022-11-28T08:43:59Z

I agree this is somewhat confusing, since I needed to mix two benchmarks above. Let me try to untangle it for you:

On your original comment (old PR), you have indicated that this approach is not faster due to the lack of vectorization on Core.

That was true at the time of writing it.

But now you find it about 26% faster, which is not as fast as you expect. Is my understanding correct?

I find it 26% faster than v1. This is the same as for the multiplication that we currently have in v2. Meaning,

v2 (main) vs. v1: +27%
v2 (this PR) vs. v1: +26%
v2 (this PR) vs v2 (main): approx 0%

Was there work around the specific op on Core that justifies the difference between the previous attempt and now?

Yes, the ops were vectorized in core by @alexsamardzic in pytorch/pytorch#88607 and pytorch/pytorch#89284. We are discussing offline as we speak if my benchmark in this PR results are correct or not. Will update here if we reached a conclusion.

Due to these changes on core, you see this speedup in the other branch (int to int downscale) that uses bitshifts, but wasn't touched by this PR at all.

datumbox

@pmeier Thanks for the explanation. The change looks good in terms of simplifying the code. Let's see what @alexsamardzic has to say concerning the results on speed.

alexsamardzic · 2022-11-28T09:05:19Z

The shift operators were not vectorized previosly. Recently merged PRs pytorch/pytorch#88607, pytorch/pytorch#88990 and pytorch/pytorch#89284 implement vectorization of shift operators for all integer datatypes, and this is where the performance improvement in the above benchmarks comes from. On the other side, multiplication operator was already vectorized for some datatypes, including int32. An operator being vectorized means that it's executed through a single (in most cases) assembly instruction, that operates on several tensor(s) elements in parallel. Corresponding assembly instructions for shifts and multiplications are typically close in number of processor cycles required (it depends on particular processor), and results of the last benchmark are as expected in that regards.

Reviewed By: jdsgomes Differential Revision: D41548197 fbshipit-source-id: 34b1f8638e3832db45d4742b1c89ab18022886f0

use bitshifts for int to int in convert_dtype

d5da6e4

pmeier added module: transforms prototype labels Nov 25, 2022

facebook-github-bot added the cla signed label Nov 25, 2022

datumbox approved these changes Nov 28, 2022

View reviewed changes

pmeier marked this pull request as ready for review November 28, 2022 09:08

Merge branch 'main' into convert-dtype

5cf3fa4

pmeier merged commit 346f6dd into pytorch:main Nov 28, 2022

pmeier deleted the convert-dtype branch November 28, 2022 09:25

facebook-github-bot pushed a commit that referenced this pull request Nov 28, 2022

[fbsync] use bitshifts for int to int in convert_dtype (#6978)

f1b43f1

Reviewed By: jdsgomes Differential Revision: D41548197 fbshipit-source-id: 34b1f8638e3832db45d4742b1c89ab18022886f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bitshifts for int to int in convert_dtype #6978

use bitshifts for int to int in convert_dtype #6978

pmeier commented Nov 25, 2022 •

edited

Loading

datumbox commented Nov 28, 2022

pmeier commented Nov 28, 2022 •

edited

Loading

datumbox left a comment

alexsamardzic commented Nov 28, 2022 •

edited by pmeier

Loading

use bitshifts for int to int in convert_dtype #6978

use bitshifts for int to int in convert_dtype #6978

Conversation

pmeier commented Nov 25, 2022 • edited Loading

datumbox commented Nov 28, 2022

pmeier commented Nov 28, 2022 • edited Loading

datumbox left a comment

Choose a reason for hiding this comment

alexsamardzic commented Nov 28, 2022 • edited by pmeier Loading

pmeier commented Nov 25, 2022 •

edited

Loading

pmeier commented Nov 28, 2022 •

edited

Loading

alexsamardzic commented Nov 28, 2022 •

edited by pmeier

Loading