Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearranged struct fields to prevent ldp page crossings #78

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dorukkarademirler
Copy link

Explanation of Structural Modifications

To enhance performance and reduce data cache pressure, the following structural modifications have been implemented:

Reordering Elements: The structure has been modified to prevent LDP instructions from crossing a 4K page boundary.

  1. Poly Array: Placed at the beginning of the structure.
  2. Invln2_Scaled: Positioned immediately after the poly array with a 16-byte alignment to ensure proper alignment.

Alignment Adjustments:

  1. The entire structure is now aligned to 64 bytes to further prevent page crossings.
  2. The tab variable has been moved by 256 bytes. This adjustment aligns the entire structure with a relatively small number, effectively fixing the page crossing error while minimizing wasted bytes in the worst-case scenario.

These changes collectively contribute to improved performance and reduced data cache pressure.

…p instructions from crossing page boundaries
@dorukkarademirler
Copy link
Author

For any issues or further communication related to this repository, please use my open source development email at Qualcomm: quic_dkaradem@quicinc.com.

@joeramsay
Copy link
Contributor

Thanks for your interest in contributing! Please could you provide some details of measured speedup, with your architecture and compiler? In case you don't know, you can use the mathbench binary to get microbench numbers.

Is there some way of achieving what you want without aligning invln2_scaled by 16? I see a small (2-3%) performance regression on Neoverse V1 with GCC 14 from this patch, I think because the alignment prevents LDP fusion with the last element of poly.

To merge this we need a signed contribution agreement, so that we can update GLIBC under our FSF copyright assignment - when the PR is ready to merge please could you fill out https://github.com/ARM-software/optimized-routines/blob/master/contributor-agreement.pdf and email it to optimized-routines-assignment@arm.com? Printed/scanned is fine

@dorukkarademirler
Copy link
Author

dorukkarademirler commented Jan 29, 2025

This issue was fixed on Qualcomm's Android build, arm64 architecture. The main issue was that after updating to LLVM 18, the LDP statements were crossing the page boundary with the original structure. These changes help improve performance and reduce data cache pressure. Rather than a speedup, these modifications are aimed at preventing anomalies and significant performance loss. I am attaching an image of the performance loss observed.

Looking at the Geekbench results, Libm.so's CPU usage was approximately 3% without page crossings. However, with page crossings, it increased to around 11%.

simpleperf record -e cpu-cycles results:
LLVM17: no page crossings.

image

LLVM18: second ldp crosses the page.
ADD

Performance Comparison

cycles final

Regarding the small (2-3%) performance regression on Neoverse V1 with GCC 14, I committed a version where there isn't any alignment for invln2_scaled. You can check that version as well.

As for the contribution agreement, Qualcomm might already have an agreement with ARM. If I need to do this individually as well, I will send it.

@Wilco1
Copy link
Contributor

Wilco1 commented Feb 10, 2025

So I don't think this change makes sense at all - the reordering just makes things worse (including other functions like exp2f).

The underlying issue is LLVM not aligning the structure to 16 bytes like GCC. If it did that, the LDPs are 16-byte aligned and cannot ever cross a page. So just add aligned(16) and all is well.

Can you run your benchmarks again with only that change?

@dorukkarademirler
Copy link
Author

If rearranging the elements negatively impacts performance, adding a 16-byte alignment on invln2_scaled in the original structure would still lead to page crossings, as it doesn't fully cover the elements of poly_scaled. (I am using a custom linker script to position the structures at the end of the page and monitor the results).

Page crossing seen here:
image

However, increasing invln2_scaled's alignment from 16 bytes to 32 bytes resolves the issue. This ensures that the elements of poly_scaled are properly aligned as well, effectively preventing any page crossings.

The benchmark results don't result in any performance loss after this change. I will commit the new version of this structure that only has the extra alignment, without any restructuring.

@Wilco1
Copy link
Contributor

Wilco1 commented Feb 11, 2025

I mean using the aligned on the whole structure. There is no point in doing it on a single field since the structure has already been optimally laid out for best alignment. This is an LLVM issue, all you need is to do is align the structure like GCC does by default.

@dorukkarademirler
Copy link
Author

The fields invln2_scaled and poly_scaled[EXP2F_POLY_ORDER] are crucial for generating ldp statements in libm.so. Aligning the entire structure won't fix the ldp page crossings, especially considering the sizes of the initial elements. The alignment would need to be higher than 256 bytes. Therefore, adding a small padding to the entire structure won't solve this problem.

To ensure that poly_scaled does not cross a page boundary, we need to make sure it either ends before the end of a page or starts at the beginning of the next one. One way to achieve this is by aligning a single field (invln2_scaled) by 32 bytes, which also ensures the alignment of poly_scaled. Therefore, this allows ldp statements to be generated without any risk of performance loss.

@Wilco1
Copy link
Contributor

Wilco1 commented Feb 11, 2025

No, that's not how alignment works at all. The field invln2_scaled is at offset 288 in the structure - a multiple of 16. The first LDP will load inln2_scaled and poly_scaled[0], the 2nd LDP reads the next 2 elements. So both are trivially 16-byte aligned LDPs if the start of the structure is also 16-byte aligned.

@dorukkarademirler
Copy link
Author

dorukkarademirler commented Feb 11, 2025

I understand your point. Adding 16 bytes to the structure should resolve the issue. I'll update the commit accordingly.

In the meantime, I will run our benchmarks on the final version.

@dorukkarademirler
Copy link
Author

I just completed running the benchmarks and I can report that there are no performance regressions. The ldp statements seem to be generated without any issues with the current version.

@Wilco1
Copy link
Contributor

Wilco1 commented Feb 18, 2025

I've committed a generic workaround for LLVM: 9bf0c3d

I've applied the extra alignment to all frequently used math functions, which should avoid page crossings in most cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants