-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rearranged struct fields to prevent ldp page crossings #78
base: master
Are you sure you want to change the base?
Conversation
…p instructions from crossing page boundaries
For any issues or further communication related to this repository, please use my open source development email at Qualcomm: quic_dkaradem@quicinc.com. |
Thanks for your interest in contributing! Please could you provide some details of measured speedup, with your architecture and compiler? In case you don't know, you can use the mathbench binary to get microbench numbers. Is there some way of achieving what you want without aligning To merge this we need a signed contribution agreement, so that we can update GLIBC under our FSF copyright assignment - when the PR is ready to merge please could you fill out https://github.com/ARM-software/optimized-routines/blob/master/contributor-agreement.pdf and email it to optimized-routines-assignment@arm.com? Printed/scanned is fine |
So I don't think this change makes sense at all - the reordering just makes things worse (including other functions like exp2f). The underlying issue is LLVM not aligning the structure to 16 bytes like GCC. If it did that, the LDPs are 16-byte aligned and cannot ever cross a page. So just add aligned(16) and all is well. Can you run your benchmarks again with only that change? |
If rearranging the elements negatively impacts performance, adding a 16-byte alignment on invln2_scaled in the original structure would still lead to page crossings, as it doesn't fully cover the elements of poly_scaled. (I am using a custom linker script to position the structures at the end of the page and monitor the results). However, increasing invln2_scaled's alignment from 16 bytes to 32 bytes resolves the issue. This ensures that the elements of poly_scaled are properly aligned as well, effectively preventing any page crossings. The benchmark results don't result in any performance loss after this change. I will commit the new version of this structure that only has the extra alignment, without any restructuring. |
I mean using the aligned on the whole structure. There is no point in doing it on a single field since the structure has already been optimally laid out for best alignment. This is an LLVM issue, all you need is to do is align the structure like GCC does by default. |
The fields invln2_scaled and poly_scaled[EXP2F_POLY_ORDER] are crucial for generating ldp statements in libm.so. Aligning the entire structure won't fix the ldp page crossings, especially considering the sizes of the initial elements. The alignment would need to be higher than 256 bytes. Therefore, adding a small padding to the entire structure won't solve this problem. To ensure that poly_scaled does not cross a page boundary, we need to make sure it either ends before the end of a page or starts at the beginning of the next one. One way to achieve this is by aligning a single field (invln2_scaled) by 32 bytes, which also ensures the alignment of poly_scaled. Therefore, this allows ldp statements to be generated without any risk of performance loss. |
No, that's not how alignment works at all. The field invln2_scaled is at offset 288 in the structure - a multiple of 16. The first LDP will load inln2_scaled and poly_scaled[0], the 2nd LDP reads the next 2 elements. So both are trivially 16-byte aligned LDPs if the start of the structure is also 16-byte aligned. |
I understand your point. Adding 16 bytes to the structure should resolve the issue. I'll update the commit accordingly. In the meantime, I will run our benchmarks on the final version. |
I just completed running the benchmarks and I can report that there are no performance regressions. The ldp statements seem to be generated without any issues with the current version. |
I've committed a generic workaround for LLVM: 9bf0c3d I've applied the extra alignment to all frequently used math functions, which should avoid page crossings in most cases. |
Explanation of Structural Modifications
To enhance performance and reduce data cache pressure, the following structural modifications have been implemented:
Reordering Elements: The structure has been modified to prevent LDP instructions from crossing a 4K page boundary.
Alignment Adjustments:
These changes collectively contribute to improved performance and reduced data cache pressure.