Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reset the cross product w component to 0 in neon #213

Closed
wants to merge 1 commit into from
Closed

fix: reset the cross product w component to 0 in neon #213

wants to merge 1 commit into from

Conversation

yono-main
Copy link

@yono-main yono-main commented Mar 12, 2024

Would it be better to set the w component to 0 before returning?
The return values on the ARM platform now differ from what they were before.

@CLAassistant
Copy link

CLAassistant commented Mar 12, 2024

CLA assistant check
All committers have signed the CLA.

@nfrechette
Copy link
Owner

Hello and thank you for your interest in RTM!

In RTM, when a container is wider than what it contains, extra SIMD lanes are ignored. For example, a 3x4 matrix is composed of 4x vector4 rows where the last SIMD lane is left undefined as it is implicitly [0, 0, 0, 1] (as a column). Similarly, for 3D vector function (like dot product, cross product, etc), as inputs, the unused W lane is ignored and the output W lane is undefined. This should be consistent across the library unless explicitly specified in the documentation (function header comment). This is an executive decision that I made as the author. While it would be entirely valid for all undefined output lanes to be explicitly set to zero as you have here, I have made the decision to leave the value undefined in order to ensure that no unnecessary work is being performed. Usually, in 3D math where the 4th lane isn't needed, its undefined value will never be particularly relevant. Only in edge cases where perhaps you need to turn it into a mask of some sort might you need a specific value there. In practice, I have seen the unused 4th lane actually used for various things and it isn't uncommon to re-purpose it to something else. In such scenarios, explicitly setting the last lane to any known value would end up being redundant work.

Note that the SSE2 version of the cross product makes no guarantee about the returned W lane as well and there you would likely need 2 instructions in order to set W to zero in a function that otherwise takes 6 instructions: setting the W to zero would thus have a 25% overhead (2 out of 8 instructions, although they would be very cheap ones). Even if these instructions are very cheap (1 cycle or less, each), they tend to add up. This may not have a measurable impact on performance in some cases, but in others it can be quite dramatic as the extra instructions can cause a larger calling function to fail inlining (e.g. whoever calls cross3). Compilers generally use the number of instructions/registers/stack space usage as heuristics to determine when to inline things, not the cycle cost of the instructions. Inlining is perhaps the biggest performance win possible in hot math code as it enables further optimizations. RTM does its best to pass as many things by register when it can, but it isn't always possible as that is dictated by the calling convention.

For those reasons, leftover lanes in outputs are left explicitly undefined in order to have functions that don't need them be as lean as possible. I do try and make an effort to set them to zero when it is free to do so (from an instruction/cycle perspective) but that is not guaranteed by the API and is no generally applicable.

If you have unit tests that rely on the leftover lane, I suggest to change them to use the 3D version of the vector comparison/testing functions as those will ignore the 4th lane.

Note that not a lot of functions support 2D vectors (yet) but the same rules will apply there and the ZW lanes will be undefined/unused.

Cheers,
Nicholas

@nfrechette nfrechette closed this Mar 12, 2024
@yono-main
Copy link
Author

Hi Nicholas~

I previously used the returned vector to construct a 4x4 matrix. According to your description, I believe I should specify the leftover lanes outside of this cross function.

Thank you for your detailed explanation, and also the excellent RTM library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants