Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

Open
grisuthedragon opened this issue Mar 11, 2024 · 15 comments
Labels
question Further information is requested

Comments

@grisuthedragon
Copy link

It seems that during the updates introduces between 1.3 and 1.4, the asynchronous partitioning is broken. In basic, we have a code

starpu_data_partion_plan(....) ; 

execute tasks on the partioned dataset

starpu_data_partition_clean(...); 

The submit / unsubmit we leave to the STARPU runtime. The kernels required for the computing the task are available as CPU and CUDA implementation. Now we observed the following cases.

StarPU 1.3.11 / CUDA 11.8 / GCC 12

  • CPU Only. Everything Correct
  • 64 CPU Cores + 1 CUDA Device: Everything Correct
  • 64 CPU Cores + 4 CUDA Devices: Everything Correct

StarPU 1.3.11 / CUDA 12.2/ GCC 12

  • CPU Only. Everything Correct
  • 64 CPU Cores + 1 CUDA Device: Everything Correct
  • 64 CPU Cores + 4 CUDA Devices: Everything Correct

StarPU 1.4.4 / CUDA 12.2/ GCC 12

  • CPU Only. Everything Correct
  • 64 CPU Cores + 1 CUDA Device: Wrong results.
  • 64 CPU Cores + 4 CUDA Devices: Wrong results.

The tasks are only gemm operations from CUBLAS or MKL.
Due to ongoing research, I could not share the code and does not have time to build an MWE til now. But in general it seems to have something in common with https://gitlab.inria.fr/starpu/starpu/-/issues/43.

@grisuthedragon
Copy link
Author

grisuthedragon commented Mar 21, 2024

So I could write a MWE example performing a GEMM on CPU oder GPU.

https://gist.github.com/grisuthedragon/0fa99935086a5945171ef63f185bbcee

With

STAPU_NCUDA=0 STARPU_SCHED=dmdas ./gemm_gpu

it works.

With

STAPU_NCUDA=1 STARPU_SCHED=dmdas ./gemm_gpu

it gives sometimes random errors.

And with

STAPU_NCUDA=4 STARPU_SCHED=dmdas ./gemm_gpu

permanently fails.

Also turning off the CPUs let the code fail:

STAPU_NCUDA=4 STARPU_SCHED=dmdas STARPU_NCPU=0 ./gemm_gpu

If it fails, it seems that the execution mostly got slow before.

The same holds true for dmda, lws, ... schedulers

GCC: 12
STARPU: 1.4.4
CUDA: 12.2.128 / 4x A100
CPU: AMD Epyc 2x 32 Cores

@sthibaul
Copy link
Collaborator

Hello,

I tried your MWE, but I'm getting

The codelet <gemm_kernel> defines the access mode 3 for the buffer 2 which is different from the mode 2 given to starpu_task_insert

and indeed the codelet says STARPU_RW for the last argument, while the insert call is STARPU_W (and thus not a wonder that computation is getting wrong since it doesn't declare to starpu that it wants to read the previous value)

@sthibaul sthibaul added the question Further information is requested label Mar 27, 2024
@grisuthedragon
Copy link
Author

Sorry, that was a copy and paste error. But changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results. I updated the gist as well.

Btw. I did not get your error message.

@sthibaul
Copy link
Collaborator

I did not get your error message.

If you configure with --enable-fast, you're jumping without a parachute

changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results

I do get correct results on a 3-gpu machine, with various schedulers.

@grisuthedragon
Copy link
Author

I installed StarPU via Spack and disabled enable-fast by now, but it does not change the behavior when using cuda.

Here is my environment file for spack: spack.yaml:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.2
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
        - gcc@12.3.0

The code runs on

  • CentOS 7.9
  • 2x AMD Epyc AMD EPYC 7452
  • 4x Nvidia A100
  • CUDA 12.2 , CUDA Driver: 535.86.10

@grisuthedragon
Copy link
Author

I did some additional test with varying BLAS implementations and got the following results

Intel OneMKL + CUDA 12.2 + GCC12 + StarPU 1.4.4:

  • oneMKL 2021.4.0 - failed
  • oneMKL 2022.2.1 - failed
  • oneMKL 2023.0.0 - failed
  • oneMKL 2024.0.0 - failed
  • oneMKL 2024.1.0 - failed

OpenBLAS 0.3.26 + CUDA 12.2 + GCC12 + StarPU 1.4.4:

  • Very slow, failed

@sthibaul Can you give some more details about your environment?

@sthibaul
Copy link
Collaborator

I used this source: @
gemm_gpu.c.txt

with this spec:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.3
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
      - gcc@12.2.0
    cuda:
      buildable: false
      externals:
      - spec: cuda@12.3
        prefix: /usr/local/cuda-12.3/
  compilers:
  - compiler:
      spec: gcc@=12.2.0
      paths:
        cc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gcc
        cxx: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/g++
        f77: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
        fc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
      flags: {}
      operating_system: centos7
      target: x86_64
      modules: []
      environment: {}
      extra_rpaths: []

compiled with

gcc --std=gnu99 gemm_gpu.c -o gemm_gpu $( pkg-config --cflags starpu-1.4) $(pkg-config --libs starpu-1.4)  -lcublas -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

ran with

STARPU_WORKER_STATS=1 STARPU_SCHED=dmdas ./gemm_gpu

On Centos 7.6.1810, with two gpus, without any error.

I tried to add A[0]++ in the cpu codelet to make sure that errors get catched, and they do.

Note: the mkl/openblas library probably doesn't matter since you said it was when adding gpus that you had issues. You can even try with STARPU_NCPU=0 to rule out the cpu question.

@grisuthedragon
Copy link
Author

I further look what happens and I upgraded to CUDA 12.4 to match your environment. I further organized an older system with two P100 instead of two to four A100 cards and there no error appears.

Back to the A100 system I get...
Running

STARPU_NCUDA=1 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu
========= COMPUTE-SANITIZER
Start... 
[starpu][starpu_interface_end_driver_copy_async] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (2.470755 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 4.59802
GFlops: 54.3712
========= ERROR SUMMARY: 0 errors

but running with

STARPU_NCUDA=2 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu

I get dozens of errors like

========= Uninitialized __global__ memory read of size 8 bytes
=========     at void cutlass::Kernel2<cutlass_80_tensorop_d884gemm_32x32_16x5_nn_align1>(T1::Params)+0xe00
=========     by thread (67,0,0) in block (0,0,0)
=========     Address 0x2ad0ba000fb8
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e94f]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x19758dc]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x13c9924]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x8d5e70]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0xa10a0e]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame:cublasLtDDDMatmul [0xa2c440]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x86acf4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0x86d1b5]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0xb322c4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublasDgemm_v2 [0x2cf279]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublas_mult in /mechthild/home/koehlerm/work/software/starputests/gemm_gpu.c:63 [0x15a5]
=========                in /mechthild/home/koehlerm/work/software/starputests/./gemm_gpu
=========     Host Frame:execute_job_on_cuda in drivers/cuda/driver_cuda.c:2009 [0x10e602]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2160 [0x10ed53]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2325 [0x10f710]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:start_thread [0x7ea4]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfe9fc]
=========                in /lib64/libc.so.6

The reason seems that on the A100 cards, cutlass is used in GEMM operations, on P100 not.

@grisuthedragon
Copy link
Author

grisuthedragon commented May 29, 2024

I updated my installation to StarPU 1.4.6 and CUDA 12.5. and run the "dgemm" example from examples/mult on my 4x A100 machine. In this way, the error gets independent of my code.

Now the following errors appear

  • 1 GPU (correct in 10 of 10 runs)
$ STARPU_SCHED=dmdas STARPU_NCUDA=1 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
3840	3840	3840	1327	85.3
========= ERROR SUMMARY: 0 errors
  • 2 GPUs (correct in 10 of 10 runs)
$ STARPU_SCHED=dmdas STARPU_NCUDA=2 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
3840	3840	3840	864	131.1
========= ERROR SUMMARY: 0 errors
  • 3 GPUs (random crashes, in 5 of 10 runs)
$ STARPU_SCHED=dmdas STARPU_NCUDA=3 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure 

[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2abcf1959b79]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2abcf195d173]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2abcf195d7e4]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2abcf195e1a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2abcf1a3fea5]
/lib64/libc.so.6(clone+0x6d)[0x2abd10ee09fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

  • 4 GPUs (always crashes)
$ STARPU_SCHED=dmdas STARPU_NCUDA=4 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure 

/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e8a3)[0x2b1ce4d448a3]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xadae5)[0x2b1ce4ce3ae5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae624)[0x2b1ce4ce4624]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae836)[0x2b1ce4ce4836]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae9b9)[0x2b1ce4ce49b9]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5]
/lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd]
[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1547)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e40a)[0x2b1ce4d4440a]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xac4f2)[0x2b1ce4ce24f2]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae7dd)[0x2b1ce4ce47dd]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae98d)[0x2b1ce4ce498d]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5]
/lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd]
[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

or

========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2ab39dcdcb79]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2ab39dce0173]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2ab39dce07e4]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2ab39dce11a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2ab39ddc2ea5]
/lib64/libc.so.6(clone+0x6d)[0x2ab3bd2639fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

@sthibaul
Copy link
Collaborator

I have fixed very related cases yesterday with 3b258cb620de7610f0b6fadaae959f1e173f0e34 ("Fix asynchronous partitioning with data without home node"), could you check against that version?

@grisuthedragon
Copy link
Author

I tried the dgemm example again on my hardware and still get errors with more than 2 A100 cards, but with a lower probability as it seems.

If an error occur it looks like:

$ STARPU_SCHED=dmdas STARPU_NCUDA=3 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 1 took a very long time (40.732033 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_malloc_on_device (drivers/cuda/driver_cuda.c:1227)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cuda_report_error+0x7b)[0x2b222804892b]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x12348a)[0x2b222804a48a]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xb9081)[0x2b2227fe0081]


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1594)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cuda_report_error+0x7b)[0x2b222804892b]
cea31/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x123eba)[0x2b222804aeba]
)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b2227ff5a31]
(+0xae952)[0x2b2227fd5952]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xc028d)[0x2b2227fe728d/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1]
(+0xb0c3d)[0x2b2227fd7c3d]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1+0xb1b1f)(+0xb0e19)[0x2b2227fd7e19]
[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b2227fd8b1f]
(+0xb0f05)[0x2b2227fd7f05/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1]
(/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1+0xaff45(_starpu_cuda_driver_run_once+0x92a)[0x2b222804ca3a]
)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b2227fd6f45]
(+0x125e51)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b222804ce51]
(/lib64/libpthread.so.0(+0x7ea5)[0x2b22285a4ea5]
+0xb0a84)/lib64/libc.so.6[0x2b2227fd7a84]
(clone+0x6d)[0x2b2247a459fd]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2590]
(+0xb0c96)[0x2b2227fd7c96]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xb0ded)[0x========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

or

========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 1 took a very long time (42.814927 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cublas_report_error+0x79)[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cublas_report_error+0x792b79b0ae9809]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1)[0x(+0x124d71)[0x2b79b0aecd71]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b79b0ae9809]
(_starpu_cuda_driver_run_once+0x314)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x(+0x2b79b0aed424]
124d71/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x125e51))[0x[0x2b79b0aede51]
2b79b0aecd71/lib64/libpthread.so.0]
(+0x7ea5)[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b79b1045ea5]
(_starpu_cuda_driver_run_once+0x314/lib64/libc.so.6(clone+0x6d)[0x2b79d04e69fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2584]
)[0x2b79b0aed424]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

Especially the case of 3 GPUs seems to be still affected by this problem.

Regarding my own GEMM Code, posted above, it still fails but in addition to the above error, the following appear as well:

========= Host API memory access error at host access to 0x2abccfb68700 of size 256 bytes
=========     Uninitialized access at 0x2abccfb687c0 on access by cudaMemcpy source
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x327a82]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x48128]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame: [0x19ae1]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame:cudaMemcpy2DAsync [0x72d56]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame:starpu_cuda_copy2d_async_sync_devid in drivers/cuda/driver_cuda.c:1450 [0x123ad9]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_copy2d_data_from_cuda_to_cpu in drivers/cuda/driver_cuda.c:1825 [0x123c97]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:copy_any_to_any in datawizard/interfaces/matrix_interface.c:586 [0xce89a]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_copy_interface_from_cuda_to_cpu in drivers/cuda/driver_cuda.c:1726 [0x124494]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:__starpu_handle_node_data_requests in datawizard/data_request.c:744 [0xaff44]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_handle_node_data_requests in datawizard/data_request.c:806 [0xb0a83]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:____starpu_datawizard_progress in datawizard/datawizard.c:52 [0xb0c95]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:___starpu_datawizard_progress in datawizard/datawizard.c:105 [0xb0e18]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:__starpu_datawizard_progress in datawizard/datawizard.c:149 [0xb0f04]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2449 [0x125558]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2524 [0x125e50]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:start_thread [0x7ea4]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfe9fc]
=========                in /lib64/libc.so.6

@sthibaul
Copy link
Collaborator

719: unspecified launch failure

This is very questioning as reported error... Could you try through compute-sanitizer --tool memcheck , to make sure that we are not disturbing the allocator, in which case the failure reported here don't seem to be starpu's fault, but somehow a cuda stack issue...

@sthibaul
Copy link
Collaborator

I've been running dgemm in a loop with 4 A100 on cuda 12.0 here, no error for half an hour

@sthibaul
Copy link
Collaborator

Did you try without compute-sanitizer --tool initcheck ? With it I'm seeing weird reports.

On tests as simple as tests/datawizard/lazy_allocation.c it reports an initialization error when transferring from a gpu to another, while tracing shows that the initialization kernel was really completed before the transfer was queued.

@grisuthedragon
Copy link
Author

With the current master, the example dgemm works, but my async partitioned one ends with

$ STARPU_SCHED=dmdas STARPU_NCUDA=1 ./gemm_gpu
Start... 
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (0.398636 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 1.19607
GFlops: 1.67215
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xa9430)[0x2acdfba4e430]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xaa674)[0x2acdfba4f674]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(starpu_memchunk_tidy+0x67e)[0x2acdfba6408e]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb0db2)[0x2acdfba55db2]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb0f79)[0x2acdfba55f79]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb1065)[0x2acdfba56065]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x449)[0x2acdfbacaa19]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0x126311)[0x2acdfbacb311]
/lib64/libpthread.so.0(+0x7ea5)[0x2acdfbbb6ea5]
/lib64/libc.so.6(clone+0x6d)[0x2ace22c1b9fd]

[starpu][_starpu_select_src_node][assert failure] The data for the handle 0x385d040 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?

gemm_gpu: datawizard/coherency.c:69: _starpu_select_src_node: Assertion `0 && "src_node_mask != 0"' failed.
Aborted (core dumped)

using only one GPU. Using 2, 3, or 4 GPUs the error stays the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants