-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes with the latest COSMA release #115
Comments
Hi Frederick, Unfortunately, it seems I can't access the cscs infrastructure anymore. Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this. Maybe @teonnik or @simonpintarelli could have a look? |
As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:
By default these values are 5k, so you can try reducing them. However, the gpu memory footprint has not changed since the last version, so this should not be a problem. |
Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only? |
I can't reproduce the bug using the miniapps ( |
I am not familiar with all of them. I can provide more details in the following cases:
In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal. I hope that provides already a few hints. |
Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr). |
Thank you Frederick for more details and thanks Simon for chiming in! @fstein93 regarding your questions above:
Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:
Then, Simon could rerun it using the miniapp on daint. Would it be possible? |
It seems the crashes happen because I added a print statement at tiled_mm.cpp:96 and then ran
Looking at the docs it seems there exist multiply ways to upset |
Thanks @oschuett for debugging it! Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation. |
Voilà: H2O-sos-mp2-lr.txt |
@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation. However, few things from your file caught my attention:
The difference is that Since cp2k anyway calls the |
All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here). |
Thanks @fstein93 for clarifications! It seems I misunderstood the output then. Hope Simon will be able to reproduce it by running the newly added tests. Btw, do we know if |
You can get the linker line from the regtest report:
|
Simon managed to reproduce this errror within COSMA, we are working on it! |
@oschuett just a quick question: after you added those print statements, what is in your line: I want to see if the error occurred within: round_robin or within round_robin_without_copy_c |
@kabicm the tests passed with the previous version. There is only one which I added recently. |
In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base. |
You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests. |
We would surely make a new release once we are sure this fixes the failing tests. |
It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release. |
The new version COSMA-v2.6.1 is now released. Let us know if there are any issues! |
I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1. |
Dear COSMA developers,
I am one of the CP2K developers and have recently upgraded our scripts to use COSMA 2.6.0 (see discussion cp2k/cp2k#2198 ). After the upgrade, all of our GPU regtests fail (see https://dashboard.cp2k.org/, testers CRAY-XC50-gnu, Performance CUDA Volta, CUDA Pascal). Our HIP tester does not make use of COSMAs GPU backend yet.
The typical backtrace looks as followed
error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU ERROR
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x7f5d6f019d21 in ???
#1 0x7f5d6f018ef5 in ???
#2 0x7f5d6ec7208f in ???
#3 0x7f5d6ec7200b in ???
#4 0x7f5d6ec51858 in ???
#5 0x7f5d8688b910 in ???
#6 0x7f5d8689738b in ???
#7 0x7f5d868973f6 in ???
#8 0x7f5d868976a8 in ???
#9 0x55652e0befd9 in check_runtime_status
#10 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EEPS2_NS_10tile_coordERNS_13device_streamE
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:46
#11 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EERNS_13device_bufferIS2_EENS_10tile_coordERNS_11gpu_contextEi
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:52
#12 0x556531739d92 in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248
#13 0x55653173ac52 in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_bb
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:468
#14 0x556531702744 in _ZN5cosma14local_multiplyIdEEvPNS_13cosma_contextIT_EEPS2_S5_S5_iiiS2_S2_b
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/local_multiply.cpp:168
#15 0x5565316e8612 in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:381
#16 0x5565316e801c in ZN5cosma8parallelIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:867
#17 0x5565316e87e0 in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:408
#18 0x5565316e8a7a in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:283
#19 0x5565316c48a3 in ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1_iiS5
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/cosma_pxgemm.cpp:350
Do you have an idea what this error causes? I am happy to share further information if required.
The text was updated successfully, but these errors were encountered: