Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash (FPE) in check_hip.exe on LUMI #1003

Closed
valassi opened this issue Sep 17, 2024 · 3 comments · Fixed by #1006
Closed

Crash (FPE) in check_hip.exe on LUMI #1003

valassi opened this issue Sep 17, 2024 · 3 comments · Fixed by #1006
Assignees

Comments

@valassi
Copy link
Member

valassi commented Sep 17, 2024

I am running tests of AMD GPUs on LUMI #998

I am getting a new very bizarre crash

[valassia@nid007963 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp > gdb --args /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.hip_d_inl0_hrd0/check_hip.exe --common -p 2 64 2
GNU gdb (GDB; SUSE Linux Enterprise 15) 12.1
...
Reading symbols from /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.hip_d_inl0_hrd0/check_hip.exe...
(gdb) set style enabled off
(gdb) run
...
Thread 1 "check_hip.exe" received signal SIGFPE, Arithmetic exception.
0x0000155555508473 in mg5amcGpu::EventStatistics::operator+=(mg5amcGpu::EventStatistics const&) () from /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.hip_d_inl0_hrd0/../../../lib/build.hip_d_inl0_hrd0/libmg5amc_gg_ttx_hip.so
Missing separate debuginfos, use: zypper install comgr-debuginfo-2.6.0.60003-sles154.131.x86_64 hip-runtime-amd-debuginfo-6.0.32831.60003-sles154.131.x86_64 hsa-rocr-debuginfo-1.12.0.60003-sles154.131.x86_64 libdrm2-debuginfo-2.4.114-150500.3.2.x86_64 libdrm_amdgpu1-debuginfo-2.4.114-150500.3.2.x86_64 libelf1-debuginfo-0.185-150400.5.3.1.x86_64 libgcc_s1-debuginfo-13.2.1+git7813-150000.1.6.1.x86_64 libncurses6-debuginfo-6.1-150000.5.20.1.x86_64 libnuma1-debuginfo-2.0.14.20.g4ee5e0c-150400.1.24.x86_64 libstdc++6-debuginfo-13.2.1+git7813-150000.1.6.1.x86_64 libz1-debuginfo-1.2.13-150500.4.3.1.x86_64 libzstd1-debuginfo-1.5.0-150400.3.3.1.x86_64
(gdb) where
#0  0x0000155555508473 in mg5amcGpu::EventStatistics::operator+=(mg5amcGpu::EventStatistics const&) ()
   from /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.hip_d_inl0_hrd0/../../../lib/build.hip_d_inl0_hrd0/libmg5amc_gg_ttx_hip.so
#1  0x000015555550810d in mg5amcGpu::CrossSectionKernelHost::updateEventStatistics(bool) ()
   from /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.hip_d_inl0_hrd0/../../../lib/build.hip_d_inl0_hrd0/libmg5amc_gg_ttx_hip.so
#2  0x000000000021517c in main ()
@valassi
Copy link
Member Author

valassi commented Sep 17, 2024

And... this is yet another FPE crash that disappears in debug mode. I will add another volatile, most likely, but this is becoming too much.

@valassi valassi self-assigned this Sep 17, 2024
@valassi
Copy link
Member Author

valassi commented Sep 17, 2024

Ok this magically fixes it

    // Combine two EventStatistics                                                                                                          
#if __HIP_CLANG_ONLY__
    // Disable optimizations for this function in HIPCC (work around FPE crash #1003)                                                       
    // See https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-selectively-disabling-optimization                            
    __attribute__((optnone))
#endif
    EventStatistics& operator+=( const EventStatistics& stats )
    {

See https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-selectively-disabling-optimization

See issue 3653 in https://github.com/microsoft/DeepSpeed/issues for HIP_CLANG_ONLY

valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
…graph5#1003 by disabling SIMD in C++ objects for HIP builds - it does not help, will revert
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
Revert "[amd] in gg_tt.mad cudacpp.mk, try to work around the HIP crashes madgraph5#1003 by disabling SIMD in C++ objects for HIP builds - it does not help, will revert"
This reverts commit 2fc102767ecc6ae2e95770f4cff18e5c08d31fc1.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
…h5#1003 by disabling SIMD in C++ objects built with hipcc - it also does not help, will revert
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
Revert "[amd] in gg_tt.mad cudacpp.mk, try to work around HIP crashes madgraph5#1003 by disabling SIMD in C++ objects built with hipcc - it also does not help, will revert"
This reverts commit 1e225fd7068eb0c67377f55c7e910af945a4d963.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
Revert "[amd] in gg_tt.mad EventStatistics.h, try to work around HIP crashes madgraph5#1003 by adding volatile - it does not work, will revert"
This reverts commit e2591da7b159b6d133a7cff7a4b583a8ad34d563.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
…h5#1003 by printing out sum.nevtOK() - this avoids teh crash but is not practical, will revert
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
Revert "[amd] in gg_tt.mad EventStatistics.h, work around HIP crashes madgraph5#1003 by printing out sum.nevtOK() - this avoids teh crash but is not practical, will revert"
This reverts commit 725dae88d89a61d005a0031c9462fe95f4ec6728.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
…rash madgraph5#1005 on clang16 by disabling optimizations for operator+=

This extends to any clang the previous workaround for madgraph5#1003 which had been defined only for HIP clang
valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 18, 2024
@valassi valassi linked a pull request Sep 18, 2024 that will close this issue
@valassi
Copy link
Member Author

valassi commented Sep 19, 2024

This is fixed in PR #1006. Closing

@valassi valassi closed this as completed Sep 19, 2024
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
…graph5#1003 by disabling SIMD in C++ objects for HIP builds - it does not help, will revert
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
Revert "[amd] in gg_tt.mad cudacpp.mk, try to work around the HIP crashes madgraph5#1003 by disabling SIMD in C++ objects for HIP builds - it does not help, will revert"
This reverts commit 2fc102767ecc6ae2e95770f4cff18e5c08d31fc1.
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
…h5#1003 by disabling SIMD in C++ objects built with hipcc - it also does not help, will revert
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
Revert "[amd] in gg_tt.mad cudacpp.mk, try to work around HIP crashes madgraph5#1003 by disabling SIMD in C++ objects built with hipcc - it also does not help, will revert"
This reverts commit 1e225fd7068eb0c67377f55c7e910af945a4d963.
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
Revert "[amd] in gg_tt.mad EventStatistics.h, try to work around HIP crashes madgraph5#1003 by adding volatile - it does not work, will revert"
This reverts commit e2591da7b159b6d133a7cff7a4b583a8ad34d563.
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
…h5#1003 by printing out sum.nevtOK() - this avoids teh crash but is not practical, will revert
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
Revert "[amd] in gg_tt.mad EventStatistics.h, work around HIP crashes madgraph5#1003 by printing out sum.nevtOK() - this avoids teh crash but is not practical, will revert"
This reverts commit 725dae88d89a61d005a0031c9462fe95f4ec6728.
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
…rash madgraph5#1005 on clang16 by disabling optimizations for operator+=

This extends to any clang the previous workaround for madgraph5#1003 which had been defined only for HIP clang
zeniheisser pushed a commit to zeniheisser/madgraph4gpu that referenced this issue Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant