-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize TrimTransparentPixels in GifEncoderCore #2500
Vectorize TrimTransparentPixels in GifEncoderCore #2500
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes for review.
Vector256<byte> vec = Vector256.LoadUnsafe(ref rowPtr, (nuint)x); | ||
Vector256<byte> notEquals = ~Vector256.Equals(vec, trimmableVec256); | ||
|
||
if (notEquals != Vector256<byte>.Zero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I don't have any idea on how to make this branchless.
isTransparentRow
could be tracked in a vector, but left
and right
not, as there's a mismatch of vector-types, namely byte
and int
.
A quite complicated approach would be to use VectorXYZ<byte>
and track the left and right -- but just before these can overflow merge it back to the scalar left
, right
and start over. But I guess the book-keeping is more work, so I'm not sure if this is actually faster. For sure the code gets painful.
} | ||
} | ||
#endif | ||
for (; x < rowLength; ++x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remainder could be handled vectorized too, by shifting the mask of the most significant bits around by the count of elements left in the final vector.
I tried this somewhere else, the cost for that book-keeping isn't negligible, so didn't do this here and now (maybe I'll try this later).
Hm, I remembered that movemask isn't the fastest, and ptest (TestZ in .NET-terms) is faster but current benchmarks didn't prove this, also Intel's instruction table didn't show any benefit in terms of latency or throughput. Thus simplified that check.
Oof! Look at those numbers!! Thanks so much for looking at this. I'm going have a good dig through it to wrap my head round what you have done. |
😃 PS: I'm back on Tuesday, so maybe slow to respond in the meantime. |
This is fantastic stuff. I figured theoretically after reading some of the source for Tip of the cap to you sir. |
Thanks for the kind words ❤️ |
Prerequisites
Description
See #2455 (comment)
A simple benchmark -- just for the inner loop -- yields:
This is measured with .NET 7, but the codegen for .NET 6 is very similar.
benchmark code