Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save memory by skipping the shuffle map from Radix4 and Radix3 #81

Merged
merged 5 commits into from
Oct 20, 2021

Conversation

ejmahler
Copy link
Owner

I was looking into how to make the bit reversal in Radix4 and Radix3 more friendly to SIMD. I was working under the assumption that the bit reversals were too expensive to do in the outer loop of bitreversed_transpose(), but during my experiments, i stumbled across something that made me challenge that assumption.

I discovered that there was little or no performance difference between

  • Using the shuffle map as-is
  • Unrolling one step of the shuffle map, so that it only stores some of the values, with the rest being reconstructed via simple arithmetic
  • Entirely eliminating the shuffle map, and computing one bit reversal per outer loop, with the rest of the bit reversals being reconstructed
  • Just computing all the bit reversals in the outer loop, with no fancy reconstruction.

As a result, this PR changes Radix 4 and Radix 3 to the last bullet point, completely eliminating the shuffle map. This makes radix4 and radix3 simpler, and creates a much more obvious path for SIMD-ification of the bit reversal algorithm. Although after my experiments here, I'm not too confident that SIMD bit reversal will make much of a difference.

@ejmahler
Copy link
Owner Author

@HEnquist may find this interesting

let x0 = 4 * x;
let x1 = 4 * x + 1;
let x2 = 4 * x + 2;
let x3 = 4 * x + 3;

let x_rev = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense to make a "reverse_bits_of_four" function that reverses four numbers in the same loop. I'm guessing that would make it good for the auto-vectorizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not a good idea. That actually runs a tiny bit slower for some odd reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update on that. I can't measure any difference so it's not slower. And also not faster. Let's not bother.

@HEnquist
Copy link
Contributor

This is very interesting! I didn't consider this approach since I just assumed it would be slower. How does the speed compare to the map version?

@ejmahler
Copy link
Owner Author

I'm not sure what you mean by map version.

@HEnquist
Copy link
Contributor

I'm not sure what you mean by map version.

Oh just the previous version, before this change.

@ejmahler
Copy link
Owner Author

Ah. The speed difference is within the noise range of the benchmarker. So there may be a difference, but it's too small to see.

@HEnquist
Copy link
Contributor

I can confirm that it compiles and passes the tests just fine on an aarch64 machine.

@ejmahler ejmahler merged commit b0d9bd0 into master Oct 20, 2021
@ejmahler ejmahler deleted the radix4-noshuffle branch October 20, 2021 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants