Rearranged struct fields to prevent ldp page crossings #78

dorukkarademirler · 2025-01-28T18:51:58Z

Explanation of Structural Modifications

To enhance performance and reduce data cache pressure, the following structural modifications have been implemented:

Reordering Elements: The structure has been modified to prevent LDP instructions from crossing a 4K page boundary.

Poly Array: Placed at the beginning of the structure.
Invln2_Scaled: Positioned immediately after the poly array with a 16-byte alignment to ensure proper alignment.

Alignment Adjustments:

The entire structure is now aligned to 64 bytes to further prevent page crossings.
The tab variable has been moved by 256 bytes. This adjustment aligns the entire structure with a relatively small number, effectively fixing the page crossing error while minimizing wasted bytes in the worst-case scenario.

These changes collectively contribute to improved performance and reduced data cache pressure.

…p instructions from crossing page boundaries

dorukkarademirler · 2025-01-28T19:10:47Z

For any issues or further communication related to this repository, please use my open source development email at Qualcomm: quic_dkaradem@quicinc.com.

joeramsay · 2025-01-29T10:19:47Z

Thanks for your interest in contributing! Please could you provide some details of measured speedup, with your architecture and compiler? In case you don't know, you can use the mathbench binary to get microbench numbers.

Is there some way of achieving what you want without aligning invln2_scaled by 16? I see a small (2-3%) performance regression on Neoverse V1 with GCC 14 from this patch, I think because the alignment prevents LDP fusion with the last element of poly.

To merge this we need a signed contribution agreement, so that we can update GLIBC under our FSF copyright assignment - when the PR is ready to merge please could you fill out https://github.com/ARM-software/optimized-routines/blob/master/contributor-agreement.pdf and email it to optimized-routines-assignment@arm.com? Printed/scanned is fine

dorukkarademirler · 2025-01-29T20:26:38Z

This issue was fixed on Qualcomm's Android build, arm64 architecture. The main issue was that after updating to LLVM 18, the LDP statements were crossing the page boundary with the original structure. These changes help improve performance and reduce data cache pressure. Rather than a speedup, these modifications are aimed at preventing anomalies and significant performance loss. I am attaching an image of the performance loss observed.

Looking at the Geekbench results, Libm.so's CPU usage was approximately 3% without page crossings. However, with page crossings, it increased to around 11%.

simpleperf record -e cpu-cycles results:
LLVM17: no page crossings.

LLVM18: second ldp crosses the page.

Performance Comparison

Regarding the small (2-3%) performance regression on Neoverse V1 with GCC 14, I committed a version where there isn't any alignment for invln2_scaled. You can check that version as well.

As for the contribution agreement, Qualcomm might already have an agreement with ARM. If I need to do this individually as well, I will send it.

Wilco1 · 2025-02-10T14:19:29Z

So I don't think this change makes sense at all - the reordering just makes things worse (including other functions like exp2f).

The underlying issue is LLVM not aligning the structure to 16 bytes like GCC. If it did that, the LDPs are 16-byte aligned and cannot ever cross a page. So just add aligned(16) and all is well.

Can you run your benchmarks again with only that change?

dorukkarademirler · 2025-02-10T20:07:09Z

If rearranging the elements negatively impacts performance, adding a 16-byte alignment on invln2_scaled in the original structure would still lead to page crossings, as it doesn't fully cover the elements of poly_scaled. (I am using a custom linker script to position the structures at the end of the page and monitor the results).

Page crossing seen here:

However, increasing invln2_scaled's alignment from 16 bytes to 32 bytes resolves the issue. This ensures that the elements of poly_scaled are properly aligned as well, effectively preventing any page crossings.

The benchmark results don't result in any performance loss after this change. I will commit the new version of this structure that only has the extra alignment, without any restructuring.

Wilco1 · 2025-02-11T02:37:20Z

I mean using the aligned on the whole structure. There is no point in doing it on a single field since the structure has already been optimally laid out for best alignment. This is an LLVM issue, all you need is to do is align the structure like GCC does by default.

dorukkarademirler · 2025-02-11T16:41:37Z

The fields invln2_scaled and poly_scaled[EXP2F_POLY_ORDER] are crucial for generating ldp statements in libm.so. Aligning the entire structure won't fix the ldp page crossings, especially considering the sizes of the initial elements. The alignment would need to be higher than 256 bytes. Therefore, adding a small padding to the entire structure won't solve this problem.

To ensure that poly_scaled does not cross a page boundary, we need to make sure it either ends before the end of a page or starts at the beginning of the next one. One way to achieve this is by aligning a single field (invln2_scaled) by 32 bytes, which also ensures the alignment of poly_scaled. Therefore, this allows ldp statements to be generated without any risk of performance loss.

Wilco1 · 2025-02-11T17:12:58Z

No, that's not how alignment works at all. The field invln2_scaled is at offset 288 in the structure - a multiple of 16. The first LDP will load inln2_scaled and poly_scaled[0], the 2nd LDP reads the next 2 elements. So both are trivially 16-byte aligned LDPs if the start of the structure is also 16-byte aligned.

dorukkarademirler · 2025-02-11T22:28:50Z

I understand your point. Adding 16 bytes to the structure should resolve the issue. I'll update the commit accordingly.

In the meantime, I will run our benchmarks on the final version.

dorukkarademirler · 2025-02-13T15:44:42Z

I just completed running the benchmarks and I can report that there are no performance regressions. The ldp statements seem to be generated without any issues with the current version.

Wilco1 · 2025-02-18T15:56:49Z

I've committed a generic workaround for LLVM: 9bf0c3d

I've applied the extra alignment to all frequently used math functions, which should avoid page crossings in most cases.

Rearranged struct fields and added alignment attributes to prevent ld…

8fb6cbf

…p instructions from crossing page boundaries

Remove whitespaces

4f857bd

Removed the invln2_scaled alignment and restructured

71daa87

Added alignment, without restructuring

389be86

Aligned exp2f_data to prevent page crossings

ff428af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rearranged struct fields to prevent ldp page crossings #78

Rearranged struct fields to prevent ldp page crossings #78

dorukkarademirler commented Jan 28, 2025

dorukkarademirler commented Jan 28, 2025

joeramsay commented Jan 29, 2025

dorukkarademirler commented Jan 29, 2025 •

edited

Loading

Wilco1 commented Feb 10, 2025

dorukkarademirler commented Feb 10, 2025

Wilco1 commented Feb 11, 2025

dorukkarademirler commented Feb 11, 2025

Wilco1 commented Feb 11, 2025

dorukkarademirler commented Feb 11, 2025 •

edited

Loading

dorukkarademirler commented Feb 13, 2025

Wilco1 commented Feb 18, 2025

Rearranged struct fields to prevent ldp page crossings #78

Are you sure you want to change the base?

Rearranged struct fields to prevent ldp page crossings #78

Conversation

dorukkarademirler commented Jan 28, 2025

dorukkarademirler commented Jan 28, 2025

joeramsay commented Jan 29, 2025

dorukkarademirler commented Jan 29, 2025 • edited Loading

Wilco1 commented Feb 10, 2025

dorukkarademirler commented Feb 10, 2025

Wilco1 commented Feb 11, 2025

dorukkarademirler commented Feb 11, 2025

Wilco1 commented Feb 11, 2025

dorukkarademirler commented Feb 11, 2025 • edited Loading

dorukkarademirler commented Feb 13, 2025

Wilco1 commented Feb 18, 2025

dorukkarademirler commented Jan 29, 2025 •

edited

Loading

dorukkarademirler commented Feb 11, 2025 •

edited

Loading