Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mono AOT] Improve dn_simdhash lookup on arm64 #113074

Open
1 of 3 tasks
Tracked by #110833
kotlarmilos opened this issue Mar 3, 2025 · 12 comments
Open
1 of 3 tasks
Tracked by #110833

[Mono AOT] Improve dn_simdhash lookup on arm64 #113074

kotlarmilos opened this issue Mar 3, 2025 · 12 comments
Assignees
Milestone

Comments

@kotlarmilos
Copy link
Member

kotlarmilos commented Mar 3, 2025

Description

The current implementation for arm64 in the .NET runtime isn’t optimized. Since arm64 lacks a direct intrinsic equivalent to _mm_movemask_epi8, an emulation is used, which negatively impacts performance:

// returns an index in range 0-13 on match, 14-32 if no match
static DN_FORCEINLINE(uint32_t)
find_first_matching_suffix_simd (
dn_simdhash_search_vector needle,
// Only used by the vectorized implementations; discarded by scalar.
dn_simdhash_suffixes haystack
) {
#if defined(__wasm_simd128__)
return ctz(wasm_i8x16_bitmask(wasm_i8x16_eq(needle.vec, haystack.vec)));
#elif defined(_M_AMD64) || defined(_M_X64) || (_M_IX86_FP == 2) || defined(__SSE2__)
return ctz(_mm_movemask_epi8(_mm_cmpeq_epi8(needle.m128, haystack.m128)));
#elif defined(__ARM_NEON)
dn_simdhash_suffixes match_vector;
// Completely untested.
static const dn_simdhash_suffixes byte_mask = {
{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 }
};
union {
uint8_t b[4];
uint32_t u;
} msb;
match_vector.vec = vceqq_u8(needle.vec, haystack.vec);
dn_simdhash_suffixes masked;
masked.vec = vandq_u8(match_vector.vec, byte_mask.vec);
msb.b[0] = vaddv_u8(vget_low_u8(masked.vec));
msb.b[1] = vaddv_u8(vget_high_u8(masked.vec));
return ctz(msb.u);
#else
dn_simdhash_assert(!"Scalar fallback should be in use here");
return 32;
#endif
}

This optimization can improve AOT compilation (build time) on macOS-arm64 host of MAUI template app in debug config by ~80%:

  • SIMD emulation implementation:
    • AOT compilation of the dedup assembly: 247,423 ms
    • Isolated lookup for 1,000,000 iterations: 172 ms
  • Software lookup implementation (g_hash_table_lookup):
    • AOT compilation the dedup assembly: 47,692 ms
    • Isolated lookup for 1,000,000 iterations: 66 ms

Alternative implementations

Tasks

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 3, 2025
@kotlarmilos kotlarmilos removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2025
@kotlarmilos kotlarmilos self-assigned this Mar 3, 2025
@kotlarmilos
Copy link
Member Author

@kg Could we get reasonable performance on arm64 using the existing intrinsics or do you suggest using a software fallback?

@kotlarmilos kotlarmilos added this to the 10.0.0 milestone Mar 3, 2025
@kg
Copy link
Member

kg commented Mar 3, 2025 via email

@kotlarmilos
Copy link
Member Author

I don't have access to "real" arm 64 hw for testing this though, is a VM in the cloud accurate enough?

I think so. I can help with testing, ping me directly.

I'm guessing we care about Mac or iOS here more?

Both, macOS for AOT compilation and Apple mobile for the runtime.

@jkotas
Copy link
Member

jkotas commented Mar 3, 2025

Tests on a macOS-arm64 host of a MAUI template app using debug config with perf checkpoints

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

@kg
Copy link
Member

kg commented Mar 3, 2025

Tests on a macOS-arm64 host of a MAUI template app using debug config with perf checkpoints

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

fwiw an old Blazor startup profile I had lying around showed 18% of total CPU samples inside of hashing operations (both simdhash and ghashtable), the mono runtime's startup path spends a lot of time doing hashtable ops. different from kotlar's test workload but might be helpful.

As a result when simdhash was introduced it was a significant improvement to browser startup time, and according to measurements someone else took android had improved too? I'm guessing there's significant variance in how fast the neon version is depending on which manufacturer's ARM64 chip you're using.

@kotlarmilos
Copy link
Member Author

kotlarmilos commented Mar 4, 2025

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

Sorry for not stating it explicitly, updated the description. This optimization can improve AOT compilation (build time) of the dedup assembly on macOS-arm64 host of MAUI template app in debug config by ~80%. It will also likely improve execution performance on arm64 arch.

@am11
Copy link
Member

am11 commented Mar 16, 2025

Aside; there seems to be some overlap between data structures in src/coreclr/jit{/jitstd,../hashbv.h} and src/native/containers. Perhaps we could consolidate by using src/native/containers in jitstd, which might serendipitously improve the implementation by leveraging the best of both worlds.

@kotlarmilos
Copy link
Member Author

^ @kg

@jkotas
Copy link
Member

jkotas commented Mar 19, 2025

Perhaps we could consolidate by using src/native/containers in jitstd, which might serendipitously improve

We would want to measure JIT throughput to validate the improvement. It is not obvious to me that it would be JIT throughput improvement on average.

@am11
Copy link
Member

am11 commented Mar 19, 2025

We would want to measure JIT throughput to validate the improvement. It is not obvious to me that it would be JIT throughput improvement on average.

I think JIT throughput, measured by runtime-coreclr superpmi-diffs pipeline (run as part of JIT PRs), would clarify it.

@kg
Copy link
Member

kg commented Mar 19, 2025

hashbv looks pretty well-optimized to me already and seems to be a very specialized container, unless I'm missing something. It's certainly possible we could try to vectorize it using SIMD instructions instead of the 32 or 64-bit words it currently uses, is that what you were thinking?

@am11
Copy link
Member

am11 commented Mar 19, 2025

I was thinking if there are redundancies, we could link src/native/containers object in JIT to share implementations as we do with src/native/minipal. (containers could be moved under minipal)

very specialized container

Then it would not be possible to use hashbv elsewhere (can't DRY).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants