Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes with the latest COSMA release #115

Closed
fstein93 opened this issue Jul 17, 2022 · 25 comments
Closed

Crashes with the latest COSMA release #115

fstein93 opened this issue Jul 17, 2022 · 25 comments

Comments

@fstein93
Copy link

fstein93 commented Jul 17, 2022

Dear COSMA developers,

I am one of the CP2K developers and have recently upgraded our scripts to use COSMA 2.6.0 (see discussion cp2k/cp2k#2198 ). After the upgrade, all of our GPU regtests fail (see https://dashboard.cp2k.org/, testers CRAY-XC50-gnu, Performance CUDA Volta, CUDA Pascal). Our HIP tester does not make use of COSMAs GPU backend yet.

The typical backtrace looks as followed

error: GPU API call : invalid argument
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x7f5d6f019d21 in ???
#1 0x7f5d6f018ef5 in ???
#2 0x7f5d6ec7208f in ???
#3 0x7f5d6ec7200b in ???
#4 0x7f5d6ec51858 in ???
#5 0x7f5d8688b910 in ???
#6 0x7f5d8689738b in ???
#7 0x7f5d868973f6 in ???
#8 0x7f5d868976a8 in ???

#9 0x55652e0befd9 in check_runtime_status

    at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17

#10 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EEPS2_NS_10tile_coordERNS_13device_streamE
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:46
#11 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EERNS_13device_bufferIS2_EENS_10tile_coordERNS_11gpu_contextEi
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:52
#12 0x556531739d92 in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248
#13 0x55653173ac52 in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_bb
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:468
#14 0x556531702744 in _ZN5cosma14local_multiplyIdEEvPNS_13cosma_contextIT_EEPS2_S5_S5_iiiS2_S2_b
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/local_multiply.cpp:168
#15 0x5565316e8612 in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:381
#16 0x5565316e801c in ZN5cosma8parallelIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:867
#17 0x5565316e87e0 in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:408
#18 0x5565316e8a7a in ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2_S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:283
#19 0x5565316c48a3 in ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1_iiS5
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/cosma_pxgemm.cpp:350

Do you have an idea what this error causes? I am happy to share further information if required.

@kabicm
Copy link
Collaborator

kabicm commented Jul 17, 2022

Hi Frederick,

Unfortunately, it seems I can't access the cscs infrastructure anymore.

Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.

Maybe @teonnik or @simonpintarelli could have a look?

@kabicm
Copy link
Collaborator

kabicm commented Jul 17, 2022

As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:

export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000

By default these values are 5k, so you can try reducing them.

However, the gpu memory footprint has not changed since the last version, so this should not be a problem.

@fstein93
Copy link
Author

fstein93 commented Jul 17, 2022

Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?

@simonpintarelli
Copy link
Member

I can't reproduce the bug using the miniapps (test.pdgemm, test.multiply).
@fstein93 Do you know what the matrix sizes in the cp2k regtest are?

@fstein93
Copy link
Author

fstein93 commented Jul 17, 2022

I am not familiar with all of them. I can provide more details in the following cases:

  1. QS/regtest-ri-rpa, it is n=m=83, k=76 (H2O), n=m=14, k=0 (!) or k=22 (H) and n=m=97, k=78 or k=104 (CH3).
  2. I will do some checks tomorrow with the tests lr because here, the sizes of n=m depend on the numerics.
  3. QS/regtest-gw/G0W0_H2O_PBE_periodic.inp, it is probably n=m=83, k=148.
  4. LIBTEST/test_cp_fm_gemm_01.inp check the input file, and the source code.

In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.

I hope that provides already a few hints.

@fstein93
Copy link
Author

fstein93 commented Jul 17, 2022

Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).

@kabicm
Copy link
Collaborator

kabicm commented Jul 17, 2022

Thank you Frederick for more details and thanks Simon for chiming in!

@fstein93 regarding your questions above:

  • having k=0 in test cases is not a problem! cosma_pxgemm is well tested for those cases.
  • if not all the ranks own the data is also not a problem! COSMA will in fact reduce the number of ranks further if the problem size is too small. Also, in most of the RPA cases, the matrix C is only distributed to few ranks.

Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:

Then, Simon could rerun it using the miniapp on daint. Would it be possible?

@oschuett
Copy link

It seems the crashes happen because cudaMemcpy2DAsync is called with invalid arguments.

I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp:

dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77  <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77

Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync.

@kabicm
Copy link
Collaborator

kabicm commented Jul 17, 2022

Thanks @oschuett for debugging it!

Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.

@oschuett
Copy link

Voilà: H2O-sos-mp2-lr.txt

@kabicm
Copy link
Collaborator

kabicm commented Jul 18, 2022

@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.

However, few things from your file caught my attention:

  1. It seems the error happens within cholesky decomposition?
  2. Did you link cp2k to cosma_prefixed_pxgemm library:
    add_library(cosma_prefixed_pxgemm scalapack.cpp
    or to cosma_pxgemm library:
    add_library(cosma_pxgemm scalapack.cpp

The difference is that cosma_prefixed_pxgemm only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm, cosma_psgemm and the complex versions. On the other hand cosma_pxgemm implements both the prefixed versions + overwrites default scalapack routines.

Since cp2k anyway calls the cosma_pdgemm and cosma_psgemm routines, I think you should link to prefixed_cosma_pxgemm instead of cosma_pxgemm. This way, COSMA will not be used in cholesky.

@fstein93
Copy link
Author

All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).

@kabicm
Copy link
Collaborator

kabicm commented Jul 18, 2022

Thanks @fstein93 for clarifications! It seems I misunderstood the output then.

Hope Simon will be able to reproduce it by running the newly added tests.

Btw, do we know if export CUDA_LAUNCH_BLOCKING=1 resolves the issue?

@oschuett
Copy link

Did you link cp2k to cosma_prefixed_pxgemm library:

You can get the linker line from the regtest report:

LIBS        = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp  -lmpifort -lmpicxx -lmpi  -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt 

@kabicm
Copy link
Collaborator

kabicm commented Jul 18, 2022

Simon managed to reproduce this errror within COSMA, we are working on it!

@kabicm
Copy link
Collaborator

kabicm commented Jul 20, 2022

@oschuett just a quick question: after you added those print statements, what is in your line:
at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475

I want to see if the error occurred within: round_robin or within round_robin_without_copy_c

@kabicm
Copy link
Collaborator

kabicm commented Jul 20, 2022

@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?

@fstein93
Copy link
Author

@kabicm the tests passed with the previous version. There is only one which I added recently.

@kabicm
Copy link
Collaborator

kabicm commented Jul 20, 2022

@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?

@fstein93
Copy link
Author

In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.

@oschuett
Copy link

You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.

@kabicm
Copy link
Collaborator

kabicm commented Jul 21, 2022

We would surely make a new release once we are sure this fixes the failing tests.

@kabicm
Copy link
Collaborator

kabicm commented Jul 21, 2022

It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.

@kabicm
Copy link
Collaborator

kabicm commented Jul 21, 2022

The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!

@kabicm
Copy link
Collaborator

kabicm commented Jul 22, 2022

I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.

@kabicm kabicm closed this as completed Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants