Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crashes caused by unaligned memory stores in stream_mem kernel #650

Merged
merged 2 commits into from
Nov 27, 2024

Conversation

fairydreaming
Copy link

This PR replaces movntdq instructions in stream_mem kernel with a combination of:

  • unpcklpd instructions that unpack scalar double precision floating point values from FPR1 and FPR2 into FPR1 register and values from FPR3 and FPR4 into FPR3 register
  • movntpd instructions that store values from FPR1 and FPR3 registers to memory

The original kernel caused segmentation faults due to unmet memory alignment requirement of movntdq (16 bytes) because the memory offsets in the kernel were increased by 8 bytes.

Also corrected the kernel description, INSTR_LOOP and UOPS values. Please check if the numbers are correct.

Fixes #649.

@TomTheBear
Copy link
Member

Thx for the PR.

The INSTR_LOOP is correct. You have 16 instructions in the loop plus the loop increment, the compare and the jump instruction. For the UOPS, I'm counting only 22. Each of the instructions is at least one uop. The addsd instructions load from a memory location which adds a uop each. The loop increment is one uop but the compare&jump instructions are merged into one uop. The movnt instructions are one uop on all architectures I checked because we store from an xmm register (AMD Zen partly uses 2 uops if ymm or zmm register).

@fairydreaming fairydreaming force-pushed the stream-mem-alignment-fix branch from 3805403 to 5ead1c7 Compare November 26, 2024 14:54
@fairydreaming
Copy link
Author

The INSTR_LOOP is correct. You have 16 instructions in the loop plus the loop increment, the compare and the jump instruction. For the UOPS, I'm counting only 22. Each of the instructions is at least one uop. The addsd instructions load from a memory location which adds a uop each. The loop increment is one uop but the compare&jump instructions are merged into one uop. The movnt instructions are one uop on all architectures I checked because we store from an xmm register (AMD Zen partly uses 2 uops if ymm or zmm register).

OK, I corrected it to 22.

@TomTheBear TomTheBear merged commit 653455d into RRZE-HPC:master Nov 27, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Segmentation Fault in likwid-bench when executing stream_mem benchmark on Epyc 9374F
3 participants