-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mono AOT] Improve dn_simdhash
lookup on arm64
#113074
Comments
@kg Could we get reasonable performance on arm64 using the existing intrinsics or do you suggest using a software fallback? |
I think so. I can help with testing, ping me directly.
Both, macOS for AOT compilation and Apple mobile for the runtime. |
I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>. |
fwiw an old Blazor startup profile I had lying around showed 18% of total CPU samples inside of hashing operations (both simdhash and ghashtable), the mono runtime's startup path spends a lot of time doing hashtable ops. different from kotlar's test workload but might be helpful. As a result when simdhash was introduced it was a significant improvement to browser startup time, and according to measurements someone else took android had improved too? I'm guessing there's significant variance in how fast the neon version is depending on which manufacturer's ARM64 chip you're using. |
Sorry for not stating it explicitly, updated the description. This optimization can improve AOT compilation (build time) of the dedup assembly on macOS-arm64 host of MAUI template app in debug config by ~80%. It will also likely improve execution performance on arm64 arch. |
Aside; there seems to be some overlap between data structures in |
^ @kg |
We would want to measure JIT throughput to validate the improvement. It is not obvious to me that it would be JIT throughput improvement on average. |
I think JIT throughput, measured by |
hashbv looks pretty well-optimized to me already and seems to be a very specialized container, unless I'm missing something. It's certainly possible we could try to vectorize it using SIMD instructions instead of the 32 or 64-bit words it currently uses, is that what you were thinking? |
I was thinking if there are redundancies, we could link src/native/containers object in JIT to share implementations as we do with src/native/minipal. (containers could be moved under minipal)
Then it would not be possible to use hashbv elsewhere (can't DRY). |
Description
The current implementation for arm64 in the .NET runtime isn’t optimized. Since arm64 lacks a direct intrinsic equivalent to
_mm_movemask_epi8
, an emulation is used, which negatively impacts performance:runtime/src/native/containers/dn-simdhash-arch.h
Lines 93 to 124 in 367cf39
This optimization can improve AOT compilation (build time) on macOS-arm64 host of MAUI template app in debug config by ~80%:
g_hash_table_lookup
):Alternative implementations
Tasks
The text was updated successfully, but these errors were encountered: