-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use bitshifts for int to int in convert_dtype #6978
Conversation
@pmeier Trying to understand the history around this. On your original comment (old PR), you have indicated that this approach is not faster due to the lack of vectorization on Core. But now you find it about 26% faster, which is not as fast as you expect. Is my understanding correct? Was there work around the specific op on Core that justifies the difference between the previous attempt and now? |
I agree this is somewhat confusing, since I needed to mix two benchmarks above. Let me try to untangle it for you:
That was true at the time of writing it.
I find it 26% faster than v1. This is the same as for the multiplication that we currently have in v2. Meaning,
Yes, the ops were vectorized in core by @alexsamardzic in pytorch/pytorch#88607 and pytorch/pytorch#89284. We are discussing offline as we speak if my benchmark in this PR results are correct or not. Will update here if we reached a conclusion. Due to these changes on core, you see this speedup in the other branch (int to int downscale) that uses bitshifts, but wasn't touched by this PR at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier Thanks for the explanation. The change looks good in terms of simplifying the code. Let's see what @alexsamardzic has to say concerning the results on speed.
The shift operators were not vectorized previosly. Recently merged PRs pytorch/pytorch#88607, pytorch/pytorch#88990 and pytorch/pytorch#89284 implement vectorization of shift operators for all integer datatypes, and this is where the performance improvement in the above benchmarks comes from. On the other side, multiplication operator was already vectorized for some datatypes, including int32. An operator being vectorized means that it's executed through a single (in most cases) assembly instruction, that operates on several tensor(s) elements in parallel. Corresponding assembly instructions for shifts and multiplications are typically close in number of processor cycles required (it depends on particular processor), and results of the last benchmark are as expected in that regards. |
Reviewed By: jdsgomes Differential Revision: D41548197 fbshipit-source-id: 34b1f8638e3832db45d4742b1c89ab18022886f0
Last time we measured this was in #6795 (review). In this PR the label
main
is equivalent tov1
above, since it that time, we just aliased the v1 kernel. These are the relevant lines for comparison:For converting down, i.e.
int32
touint8
, this PR doesn't change anything. Still, this branch benefits from the bitshift vectorization:For converting up, i.e.
uint8
toint32
, we replace an inplace multiplication with an inplacebitwise_left_shift_
. However, the result is underwhelmingMeaning, this PR does nothing for performance. The cleanup is good, but I expected some gains here.
@alexsamardzic Did I misunderstand your PRs? Shouldn't
bitwise_left_shift_
be quite a bit faster thanmul_
on anint32
tensor and with Python scalars as inputs.cc @vfdev-5 @datumbox @bjuncek