-
Notifications
You must be signed in to change notification settings - Fork 64
SIMD uint8x16 to 4 of uint32x4 with transpose? #348
Comments
So you want one uint8x16 turned into four uint32x4 like this: input (uint8x16): rgba,rgba,rgba,rgba (r, g, b, and a are each 8 bits) output 0 (uint32x4): r,r,r,r (each r is 32 bits) and vice-versa. Is my understanding correct? |
correct |
Maybe something like this will do the trick:
Note: I assume that the 'r's are in input lane 0, 4, 8, and 12. 'g's are in input lane 1, 5, 9, 13. Etc. |
How about the following?
On native SSE code, that would be expected to run in four cycles (of throughput cost) per iteration, since the |
Why you not create gitter? Also, how to revert to uint8, and as been? |
The reverse operation could be done like this:
Not sure, what you mean by gitter? |
gitter.im - it's a chat room for repos. @acterhd sometimes maintainers don't want the increased cost of having to check yet another place for support. |
In @PeterJensen's reverse operation, the swizzles assume that the inputs are in uint8 range, and all those swizzles of lane I doubt that the above code patterns with proposed swizzles or shuffles will have good performance, since they use the kind of swizzle and shuffle patterns that do not exist in native SSE or NEON as a fast operation. Assuming that the lane
is better written as
which could map to the PSLLD instruction that is 1 throughput clock cycle of work. |
Thanks @juj much better! The code can be simplified a bit more (fewer conversions), if the input values (srcx) are kept as Uint32x4 values. The complete function now looks like this:
So a total of 3 shift and 3 or operations |
Hello. I have question. How to drop uint8x16 to 4 of uint32x4 with transpose? Known that notation with rgba rgba rgba rgba to rrrr gggg bbbb aaaa. And backward four 32x4 to single uint8x16, again with transpose?
The text was updated successfully, but these errors were encountered: