-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Arm64/Windows] Simplify JIT_MemSet & JIT_MemCpy #17537
Conversation
JIT_Memset alignment code was definitly broken for some unaligned cases JIT_MemCpy likely had the same issue Simplify implementation to reduce maintenance burden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have a Windows Arm64 platform to test this. Relying on CI
; We need to provide our own implementation of memset instead of using the ones in crt because crt implementation does not gurantee | ||
; that aligned 8/4/2 - byte memory will be written atomically. This is required because members in a struct can be read atomically | ||
; and their values should be written atomically. | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove misleading bogus comment. there is no reason to believe these accesses need to be aligned on arm64 Windows when they are not aligned on other platforms
; // Use math to avoid introducing more unpredictable branches | ||
; // Due to inherent mod in lsr, ~7 is used instead of ~0 to handle count == 0 | ||
; // Note logic will fail is count >= (1 << 61). But this exceeds max physical memory for arm64 | ||
; uint8_t align = (dst & 0x7) & (~uint64_t(7) >> (countLeadingZeros(count) mod 64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were two bugs in the pseudo code for calculating align. The correct calculation should have been
uint8_t align =
(8 - (dst & 0x7)) & (~uint64_t(7) >> (countLeadingZeros(count/2) % 64))
This escaped due to incomplete unit testing. Test coverage also failed to hit the problematic corners.
b JIT_MemSet_0xa8 | ||
JIT_MemSet_0xa0 | ||
stp x8, x8, [x0], #16 | ||
JIT_MemSet_0xa8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete code to match pseudo code
; | ||
; count += dc_zva_size; | ||
; } | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove attempts to align dst
and attempts to use dc zva
.
Based on comments elsewhere the distribution of use cases tends to be biased to toward small objects. 2013 experiments showed 95% smaller than 64 bytes. Use of dc zva
add complexity and a series of branches which may not be predictable. It could not be used for A53, A57 or Centriq until the length was 128 bytes in this design.
Removing dc zva
is probably a win. Adding it was probably a mistake on my part.
ands w1, w1, #0xff | ||
orr w1, w1, w1, lsl #8 | ||
orr w1, w1, w1, lsl #0x10 | ||
orr x1, x1, x1, lsl #0x20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopped using w8. Use w1 instead.
subs x2, x2, #16 | ||
bge JIT_MemSet_0xa0 | ||
|
||
JIT_MemSet_0xb0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used readable labels
; count-=4; | ||
; } | ||
; } | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove attempts to align
str w8, [x0], #4 | ||
sub x2, x2, #4 | ||
b JIT_MemCpy_0xa8 | ||
JIT_MemCpy_0xa0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting to match pseudo code.
LGTM, thanks for simplifying the code and fixing the bugs around alignment. |
JIT_Memset alignment code was definitly broken for some
unaligned cases
JIT_MemCpy likely had the same issue
Simplify implementation to reduce maintenance burden
@janvorli @jkotas PTAL
@dotnet/arm64-contrib FYI
This is the Windows equivalent of #17536
Should be required to fix probable issues on windows like #17167, #17168, & #17169