[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

kotlarmilos · 2025-03-03T12:32:45Z

Description

The current implementation for arm64 in the .NET runtime isn’t optimized. Since arm64 lacks a direct intrinsic equivalent to _mm_movemask_epi8, an emulation is used, which negatively impacts performance:

runtime/src/native/containers/dn-simdhash-arch.h

Lines 93 to 124 in 367cf39

    
           // returns an index in range 0-13 on match, 14-32 if no match 
        
           static DN_FORCEINLINE(uint32_t) 
        
           find_first_matching_suffix_simd ( 
        
           	dn_simdhash_search_vector needle, 
        
           	// Only used by the vectorized implementations; discarded by scalar. 
        
           	dn_simdhash_suffixes haystack 
        
           ) { 
        
           #if defined(__wasm_simd128__) 
        
           	return ctz(wasm_i8x16_bitmask(wasm_i8x16_eq(needle.vec, haystack.vec))); 
        
           #elif defined(_M_AMD64) || defined(_M_X64) || (_M_IX86_FP == 2) || defined(__SSE2__) 
        
           	return ctz(_mm_movemask_epi8(_mm_cmpeq_epi8(needle.m128, haystack.m128))); 
        
           #elif defined(__ARM_NEON) 
        
           	dn_simdhash_suffixes match_vector; 
        
           	// Completely untested. 
        
           	static const dn_simdhash_suffixes byte_mask = { 
        
           		{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 } 
        
           	}; 
        
           	union { 
        
           		uint8_t b[4]; 
        
           		uint32_t u; 
        
           	} msb; 
        
           	match_vector.vec = vceqq_u8(needle.vec, haystack.vec); 
        
           	dn_simdhash_suffixes masked; 
        
           	masked.vec = vandq_u8(match_vector.vec, byte_mask.vec); 
        
           	msb.b[0] = vaddv_u8(vget_low_u8(masked.vec)); 
        
           	msb.b[1] = vaddv_u8(vget_high_u8(masked.vec)); 
        
           	return ctz(msb.u); 
        
           #else 
        
           	dn_simdhash_assert(!"Scalar fallback should be in use here"); 
        
               return 32; 
        
           #endif 
        
           }

This optimization can improve AOT compilation (build time) on macOS-arm64 host of MAUI template app in debug config by ~80%:

SIMD emulation implementation:
- AOT compilation of the dedup assembly: 247,423 ms
- Isolated lookup for 1,000,000 iterations: 172 ms
Software lookup implementation (g_hash_table_lookup):
- AOT compilation the dedup assembly: 47,692 ms
- Isolated lookup for 1,000,000 iterations: 66 ms

Alternative implementations

Tasks

The text was updated successfully, but these errors were encountered:

kotlarmilos · 2025-03-03T12:49:06Z

@kg Could we get reasonable performance on arm64 using the existing intrinsics or do you suggest using a software fallback?

kg · 2025-03-03T16:07:32Z

The optimized scalar search in simdhash might be better than anything we can do with neon, but I can look into the alternative methods. I don't have access to "real" arm 64 hw for testing this though, is a VM in the cloud accurate enough? I'm guessing we care about Mac or iOS here more?

…

On Mon, Mar 3, 2025, 04:49 Milos Kotlar ***@***.***> wrote: @kg <https://github.com/kg> Could we get reasonable performance on arm64 using the existing intrinsics or do you suggest using a software fallback? — Reply to this email directly, view it on GitHub <#113074 (comment)> or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABQL4UC73ZHV3E3AS4WGYT2SRFVRBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVIYTOOJTHEYTENBXGWSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJSGEYDOMJWGAYDLAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEOBZGEYTENBRGQZYFJDUPFYGLJLMMFRGK3FFOZQWY5LFVIYTOOJTHEYTENBXGWTXI4TJM5TWK4VGMNZGKYLUMU> . You are receiving this email because you were mentioned. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> .

kotlarmilos · 2025-03-03T16:26:07Z

I don't have access to "real" arm 64 hw for testing this though, is a VM in the cloud accurate enough?

I think so. I can help with testing, ping me directly.

I'm guessing we care about Mac or iOS here more?

Both, macOS for AOT compilation and Apple mobile for the runtime.

jkotas · 2025-03-03T18:06:43Z

Tests on a macOS-arm64 host of a MAUI template app using debug config with perf checkpoints

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

kg · 2025-03-03T18:22:30Z

Tests on a macOS-arm64 host of a MAUI template app using debug config with perf checkpoints

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

fwiw an old Blazor startup profile I had lying around showed 18% of total CPU samples inside of hashing operations (both simdhash and ghashtable), the mono runtime's startup path spends a lot of time doing hashtable ops. different from kotlar's test workload but might be helpful.

As a result when simdhash was introduced it was a significant improvement to browser startup time, and according to measurements someone else took android had improved too? I'm guessing there's significant variance in how fast the neon version is depending on which manufacturer's ARM64 chip you're using.

kotlarmilos · 2025-03-04T08:59:54Z

I am not sure how to interpret these numbers. How do they translate into something user observable? I am looking for a statement like: This optimization can improve <some customer observable metric - is it app startup time?> by <how much?>.

Sorry for not stating it explicitly, updated the description. This optimization can improve AOT compilation (build time) of the dedup assembly on macOS-arm64 host of MAUI template app in debug config by ~80%. It will also likely improve execution performance on arm64 arch.

am11 · 2025-03-16T17:21:14Z

Aside; there seems to be some overlap between data structures in src/coreclr/jit{/jitstd,../hashbv.h} and src/native/containers. Perhaps we could consolidate by using src/native/containers in jitstd, which might serendipitously improve the implementation by leveraging the best of both worlds.

kotlarmilos · 2025-03-18T09:40:09Z

^ @kg

jkotas · 2025-03-19T07:49:47Z

Perhaps we could consolidate by using src/native/containers in jitstd, which might serendipitously improve

We would want to measure JIT throughput to validate the improvement. It is not obvious to me that it would be JIT throughput improvement on average.

am11 · 2025-03-19T12:59:03Z

We would want to measure JIT throughput to validate the improvement. It is not obvious to me that it would be JIT throughput improvement on average.

I think JIT throughput, measured by runtime-coreclr superpmi-diffs pipeline (run as part of JIT PRs), would clarify it.

kg · 2025-03-19T13:06:41Z

hashbv looks pretty well-optimized to me already and seems to be a very specialized container, unless I'm missing something. It's certainly possible we could try to vectorize it using SIMD instructions instead of the 32 or 64-bit words it currently uses, is that what you were thinking?

am11 · 2025-03-19T13:32:12Z

I was thinking if there are redundancies, we could link src/native/containers object in JIT to share implementations as we do with src/native/minipal. (containers could be moved under minipal)

very specialized container

Then it would not be possible to use hashbv elsewhere (can't DRY).

kotlarmilos mentioned this issue Mar 3, 2025

[Mono AOT] Improve build time on Apple mobile #110833

Open

4 tasks

dotnet-issue-labeler bot added the area-Codegen-AOT-mono label Mar 3, 2025

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Mar 3, 2025

kotlarmilos removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2025

kotlarmilos self-assigned this Mar 3, 2025

kotlarmilos added this to the 10.0.0 milestone Mar 3, 2025

This was referenced Mar 3, 2025

Disable ARM NEON path in dn_simdhash due to performance issues #113092

Merged

Better ARM intrinsics implementation for dn_simdhash #113095

Merged

kg mentioned this issue Mar 6, 2025

[release/9.0-staging] Backport #113095: Better ARM intrinsics implementation for dn_simdhash #113234

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

kotlarmilos commented Mar 3, 2025 •

edited

Loading

kotlarmilos commented Mar 3, 2025

kg commented Mar 3, 2025 via email

kotlarmilos commented Mar 3, 2025

jkotas commented Mar 3, 2025 •

edited

Loading

kg commented Mar 3, 2025

kotlarmilos commented Mar 4, 2025 •

edited

Loading

am11 commented Mar 16, 2025

kotlarmilos commented Mar 18, 2025

jkotas commented Mar 19, 2025

am11 commented Mar 19, 2025

kg commented Mar 19, 2025

am11 commented Mar 19, 2025

[Mono AOT] Improve dn_simdhash lookup on arm64 #113074

[Mono AOT] Improve dn_simdhash lookup on arm64 #113074

Comments

kotlarmilos commented Mar 3, 2025 • edited Loading

Description

Alternative implementations

Tasks

kotlarmilos commented Mar 3, 2025

kg commented Mar 3, 2025 via email

kotlarmilos commented Mar 3, 2025

jkotas commented Mar 3, 2025 • edited Loading

kg commented Mar 3, 2025

kotlarmilos commented Mar 4, 2025 • edited Loading

am11 commented Mar 16, 2025

kotlarmilos commented Mar 18, 2025

jkotas commented Mar 19, 2025

am11 commented Mar 19, 2025

kg commented Mar 19, 2025

am11 commented Mar 19, 2025

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

kotlarmilos commented Mar 3, 2025 •

edited

Loading

jkotas commented Mar 3, 2025 •

edited

Loading

kotlarmilos commented Mar 4, 2025 •

edited

Loading