-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save memory by skipping the shuffle map from Radix4 and Radix3 #81
Conversation
@HEnquist may find this interesting |
let x0 = 4 * x; | ||
let x1 = 4 * x + 1; | ||
let x2 = 4 * x + 2; | ||
let x3 = 4 * x + 3; | ||
|
||
let x_rev = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense to make a "reverse_bits_of_four" function that reverses four numbers in the same loop. I'm guessing that would make it good for the auto-vectorizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, not a good idea. That actually runs a tiny bit slower for some odd reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on that. I can't measure any difference so it's not slower. And also not faster. Let's not bother.
This is very interesting! I didn't consider this approach since I just assumed it would be slower. How does the speed compare to the map version? |
I'm not sure what you mean by map version. |
Oh just the previous version, before this change. |
Ah. The speed difference is within the noise range of the benchmarker. So there may be a difference, but it's too small to see. |
I can confirm that it compiles and passes the tests just fine on an aarch64 machine. |
I was looking into how to make the bit reversal in Radix4 and Radix3 more friendly to SIMD. I was working under the assumption that the bit reversals were too expensive to do in the outer loop of
bitreversed_transpose()
, but during my experiments, i stumbled across something that made me challenge that assumption.I discovered that there was little or no performance difference between
As a result, this PR changes Radix 4 and Radix 3 to the last bullet point, completely eliminating the shuffle map. This makes radix4 and radix3 simpler, and creates a much more obvious path for SIMD-ification of the bit reversal algorithm. Although after my experiments here, I'm not too confident that SIMD bit reversal will make much of a difference.