Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

grisuthedragon · 2024-03-11T09:34:38Z

It seems that during the updates introduces between 1.3 and 1.4, the asynchronous partitioning is broken. In basic, we have a code

starpu_data_partion_plan(....) ; 

execute tasks on the partioned dataset

starpu_data_partition_clean(...);

The submit / unsubmit we leave to the STARPU runtime. The kernels required for the computing the task are available as CPU and CUDA implementation. Now we observed the following cases.

StarPU 1.3.11 / CUDA 11.8 / GCC 12

CPU Only. Everything Correct
64 CPU Cores + 1 CUDA Device: Everything Correct
64 CPU Cores + 4 CUDA Devices: Everything Correct

StarPU 1.3.11 / CUDA 12.2/ GCC 12

CPU Only. Everything Correct
64 CPU Cores + 1 CUDA Device: Everything Correct
64 CPU Cores + 4 CUDA Devices: Everything Correct

StarPU 1.4.4 / CUDA 12.2/ GCC 12

CPU Only. Everything Correct
64 CPU Cores + 1 CUDA Device: Wrong results.
64 CPU Cores + 4 CUDA Devices: Wrong results.

The tasks are only gemm operations from CUBLAS or MKL.
Due to ongoing research, I could not share the code and does not have time to build an MWE til now. But in general it seems to have something in common with https://gitlab.inria.fr/starpu/starpu/-/issues/43.

The text was updated successfully, but these errors were encountered:

grisuthedragon · 2024-03-21T14:18:49Z

So I could write a MWE example performing a GEMM on CPU oder GPU.

https://gist.github.com/grisuthedragon/0fa99935086a5945171ef63f185bbcee

With

STAPU_NCUDA=0 STARPU_SCHED=dmdas ./gemm_gpu

it works.

With

STAPU_NCUDA=1 STARPU_SCHED=dmdas ./gemm_gpu

it gives sometimes random errors.

And with

STAPU_NCUDA=4 STARPU_SCHED=dmdas ./gemm_gpu

permanently fails.

Also turning off the CPUs let the code fail:

STAPU_NCUDA=4 STARPU_SCHED=dmdas STARPU_NCPU=0 ./gemm_gpu

If it fails, it seems that the execution mostly got slow before.

The same holds true for dmda, lws, ... schedulers

GCC: 12
STARPU: 1.4.4
CUDA: 12.2.128 / 4x A100
CPU: AMD Epyc 2x 32 Cores

sthibaul · 2024-03-27T17:03:44Z

Hello,

I tried your MWE, but I'm getting

The codelet <gemm_kernel> defines the access mode 3 for the buffer 2 which is different from the mode 2 given to starpu_task_insert

and indeed the codelet says STARPU_RW for the last argument, while the insert call is STARPU_W (and thus not a wonder that computation is getting wrong since it doesn't declare to starpu that it wants to read the previous value)

grisuthedragon · 2024-03-27T18:43:49Z

Sorry, that was a copy and paste error. But changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results. I updated the gist as well.

Btw. I did not get your error message.

sthibaul · 2024-03-27T19:08:26Z

I did not get your error message.

If you configure with --enable-fast, you're jumping without a parachute

changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results

I do get correct results on a 3-gpu machine, with various schedulers.

grisuthedragon · 2024-03-28T13:35:55Z

I installed StarPU via Spack and disabled enable-fast by now, but it does not change the behavior when using cuda.

Here is my environment file for spack: spack.yaml:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.2
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
        - gcc@12.3.0

The code runs on

CentOS 7.9
2x AMD Epyc AMD EPYC 7452
4x Nvidia A100
CUDA 12.2 , CUDA Driver: 535.86.10

grisuthedragon · 2024-04-22T12:29:57Z

I did some additional test with varying BLAS implementations and got the following results

Intel OneMKL + CUDA 12.2 + GCC12 + StarPU 1.4.4:

oneMKL 2021.4.0 - failed
oneMKL 2022.2.1 - failed
oneMKL 2023.0.0 - failed
oneMKL 2024.0.0 - failed
oneMKL 2024.1.0 - failed

OpenBLAS 0.3.26 + CUDA 12.2 + GCC12 + StarPU 1.4.4:

Very slow, failed

@sthibaul Can you give some more details about your environment?

sthibaul · 2024-04-23T17:09:33Z

I used this source: @
gemm_gpu.c.txt

with this spec:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.3
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
      - gcc@12.2.0
    cuda:
      buildable: false
      externals:
      - spec: cuda@12.3
        prefix: /usr/local/cuda-12.3/
  compilers:
  - compiler:
      spec: gcc@=12.2.0
      paths:
        cc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gcc
        cxx: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/g++
        f77: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
        fc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
      flags: {}
      operating_system: centos7
      target: x86_64
      modules: []
      environment: {}
      extra_rpaths: []

compiled with

gcc --std=gnu99 gemm_gpu.c -o gemm_gpu $( pkg-config --cflags starpu-1.4) $(pkg-config --libs starpu-1.4)  -lcublas -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

ran with

STARPU_WORKER_STATS=1 STARPU_SCHED=dmdas ./gemm_gpu

On Centos 7.6.1810, with two gpus, without any error.

I tried to add A[0]++ in the cpu codelet to make sure that errors get catched, and they do.

Note: the mkl/openblas library probably doesn't matter since you said it was when adding gpus that you had issues. You can even try with STARPU_NCPU=0 to rule out the cpu question.

grisuthedragon · 2024-04-29T10:52:28Z

I further look what happens and I upgraded to CUDA 12.4 to match your environment. I further organized an older system with two P100 instead of two to four A100 cards and there no error appears.

Back to the A100 system I get...
Running

STARPU_NCUDA=1 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu
========= COMPUTE-SANITIZER
Start... 
[starpu][starpu_interface_end_driver_copy_async] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (2.470755 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 4.59802
GFlops: 54.3712
========= ERROR SUMMARY: 0 errors

but running with

STARPU_NCUDA=2 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu

I get dozens of errors like

========= Uninitialized __global__ memory read of size 8 bytes
=========     at void cutlass::Kernel2<cutlass_80_tensorop_d884gemm_32x32_16x5_nn_align1>(T1::Params)+0xe00
=========     by thread (67,0,0) in block (0,0,0)
=========     Address 0x2ad0ba000fb8
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e94f]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x19758dc]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x13c9924]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x8d5e70]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0xa10a0e]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame:cublasLtDDDMatmul [0xa2c440]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x86acf4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0x86d1b5]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0xb322c4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublasDgemm_v2 [0x2cf279]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublas_mult in /mechthild/home/koehlerm/work/software/starputests/gemm_gpu.c:63 [0x15a5]
=========                in /mechthild/home/koehlerm/work/software/starputests/./gemm_gpu
=========     Host Frame:execute_job_on_cuda in drivers/cuda/driver_cuda.c:2009 [0x10e602]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2160 [0x10ed53]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2325 [0x10f710]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:start_thread [0x7ea4]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfe9fc]
=========                in /lib64/libc.so.6

The reason seems that on the A100 cards, cutlass is used in GEMM operations, on P100 not.

grisuthedragon · 2024-05-29T09:39:04Z

I updated my installation to StarPU 1.4.6 and CUDA 12.5. and run the "dgemm" example from examples/mult on my 4x A100 machine. In this way, the error gets independent of my code.

Now the following errors appear

1 GPU (correct in 10 of 10 runs)

$ STARPU_SCHED=dmdas STARPU_NCUDA=1 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
3840	3840	3840	1327	85.3
========= ERROR SUMMARY: 0 errors

2 GPUs (correct in 10 of 10 runs)

$ STARPU_SCHED=dmdas STARPU_NCUDA=2 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
3840	3840	3840	864	131.1
========= ERROR SUMMARY: 0 errors

3 GPUs (random crashes, in 5 of 10 runs)

$ STARPU_SCHED=dmdas STARPU_NCUDA=3 compute-sanitizer --tool initcheck ./dgemm 
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure 

[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2abcf1959b79]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2abcf195d173]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2abcf195d7e4]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2abcf195e1a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2abcf1a3fea5]
/lib64/libc.so.6(clone+0x6d)[0x2abd10ee09fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

4 GPUs (always crashes)

$ STARPU_SCHED=dmdas STARPU_NCUDA=4 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure 

/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e8a3)[0x2b1ce4d448a3]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xadae5)[0x2b1ce4ce3ae5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae624)[0x2b1ce4ce4624]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae836)[0x2b1ce4ce4836]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae9b9)[0x2b1ce4ce49b9]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5]
/lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd]
[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1547)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e40a)[0x2b1ce4d4440a]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xac4f2)[0x2b1ce4ce24f2]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae7dd)[0x2b1ce4ce47dd]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae98d)[0x2b1ce4ce498d]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5]
/lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd]
[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

or

========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2ab39dcdcb79]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2ab39dce0173]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2ab39dce07e4]
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2ab39dce11a1]
/lib64/libpthread.so.0(+0x7ea5)[0x2ab39ddc2ea5]
/lib64/libc.so.6(clone+0x6d)[0x2ab3bd2639fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

sthibaul · 2024-08-29T11:55:19Z

I have fixed very related cases yesterday with 3b258cb620de7610f0b6fadaae959f1e173f0e34 ("Fix asynchronous partitioning with data without home node"), could you check against that version?

grisuthedragon · 2024-08-29T13:15:31Z

I tried the dgemm example again on my hardware and still get errors with more than 2 A100 cards, but with a lower probability as it seems.

If an error occur it looks like:

$ STARPU_SCHED=dmdas STARPU_NCUDA=3 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 1 took a very long time (40.732033 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_malloc_on_device (drivers/cuda/driver_cuda.c:1227)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cuda_report_error+0x7b)[0x2b222804892b]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x12348a)[0x2b222804a48a]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xb9081)[0x2b2227fe0081]


[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1594)... 719: unspecified launch failure 


/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cuda_report_error+0x7b)[0x2b222804892b]
cea31/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x123eba)[0x2b222804aeba]
)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b2227ff5a31]
(+0xae952)[0x2b2227fd5952]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xc028d)[0x2b2227fe728d/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1]
(+0xb0c3d)[0x2b2227fd7c3d]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1+0xb1b1f)(+0xb0e19)[0x2b2227fd7e19]
[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b2227fd8b1f]
(+0xb0f05)[0x2b2227fd7f05/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1]
(/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1+0xaff45(_starpu_cuda_driver_run_once+0x92a)[0x2b222804ca3a]
)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b2227fd6f45]
(+0x125e51)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x2b222804ce51]
(/lib64/libpthread.so.0(+0x7ea5)[0x2b22285a4ea5]
+0xb0a84)/lib64/libc.so.6[0x2b2227fd7a84]
(clone+0x6d)[0x2b2247a459fd]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2590]
(+0xb0c96)[0x2b2227fd7c96]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0xb0ded)[0x========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

or

========= COMPUTE-SANITIZER
# x	y	z	ms	GFlop/s
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 1 took a very long time (42.814927 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed 
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cublas_report_error+0x79)[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(starpu_cublas_report_error+0x792b79b0ae9809]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1)[0x(+0x124d71)[0x2b79b0aecd71]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b79b0ae9809]
(_starpu_cuda_driver_run_once+0x314)/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1[0x(+0x2b79b0aed424]
124d71/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1(+0x125e51))[0x[0x2b79b0aede51]
2b79b0aecd71/lib64/libpthread.so.0]
(+0x7ea5)[0x/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.12b79b1045ea5]
(_starpu_cuda_driver_run_once+0x314/lib64/libc.so.6(clone+0x6d)[0x2b79d04e69fd]
[starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2584]
)[0x2b79b0aed424]
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 0 errors

Especially the case of 3 GPUs seems to be still affected by this problem.

Regarding my own GEMM Code, posted above, it still fails but in addition to the above error, the following appear as well:

========= Host API memory access error at host access to 0x2abccfb68700 of size 256 bytes
=========     Uninitialized access at 0x2abccfb687c0 on access by cudaMemcpy source
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x327a82]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x48128]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame: [0x19ae1]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame:cudaMemcpy2DAsync [0x72d56]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.5.0-zutngkrw2qbccmlshetbbbcbvbykqvgc/lib64/libcudart.so.12
=========     Host Frame:starpu_cuda_copy2d_async_sync_devid in drivers/cuda/driver_cuda.c:1450 [0x123ad9]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_copy2d_data_from_cuda_to_cpu in drivers/cuda/driver_cuda.c:1825 [0x123c97]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:copy_any_to_any in datawizard/interfaces/matrix_interface.c:586 [0xce89a]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_copy_interface_from_cuda_to_cpu in drivers/cuda/driver_cuda.c:1726 [0x124494]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:__starpu_handle_node_data_requests in datawizard/data_request.c:744 [0xaff44]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_handle_node_data_requests in datawizard/data_request.c:806 [0xb0a83]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:____starpu_datawizard_progress in datawizard/datawizard.c:52 [0xb0c95]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:___starpu_datawizard_progress in datawizard/datawizard.c:105 [0xb0e18]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:__starpu_datawizard_progress in datawizard/datawizard.c:149 [0xb0f04]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2449 [0x125558]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2524 [0x125e50]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-jklt5w5z6vjs6cbf7upsdrhslp3qzkbt/lib/libstarpu-1.4.so.1
=========     Host Frame:start_thread [0x7ea4]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfe9fc]
=========                in /lib64/libc.so.6

sthibaul · 2024-08-30T13:04:41Z

719: unspecified launch failure

This is very questioning as reported error... Could you try through compute-sanitizer --tool memcheck , to make sure that we are not disturbing the allocator, in which case the failure reported here don't seem to be starpu's fault, but somehow a cuda stack issue...

sthibaul · 2024-08-30T13:33:59Z

I've been running dgemm in a loop with 4 A100 on cuda 12.0 here, no error for half an hour

sthibaul · 2024-08-30T22:28:10Z

Did you try without compute-sanitizer --tool initcheck ? With it I'm seeing weird reports.

On tests as simple as tests/datawizard/lazy_allocation.c it reports an initialization error when transferring from a gpu to another, while tracing shows that the initialization kernel was really completed before the transfer was queued.

grisuthedragon · 2024-09-03T06:13:13Z

With the current master, the example dgemm works, but my async partitioned one ends with

$ STARPU_SCHED=dmdas STARPU_NCUDA=1 ./gemm_gpu
Start... 
[starpu][starpu_interface_end_driver_copy_async_devid] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (0.398636 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 1.19607
GFlops: 1.67215
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xa9430)[0x2acdfba4e430]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xaa674)[0x2acdfba4f674]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(starpu_memchunk_tidy+0x67e)[0x2acdfba6408e]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb0db2)[0x2acdfba55db2]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb0f79)[0x2acdfba55f79]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0xb1065)[0x2acdfba56065]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x449)[0x2acdfbacaa19]
/mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/starpu-master-arhfplcvco4r33a2cx7igizzxmdmhh7a/lib/libstarpu-1.4.so.1(+0x126311)[0x2acdfbacb311]
/lib64/libpthread.so.0(+0x7ea5)[0x2acdfbbb6ea5]
/lib64/libc.so.6(clone+0x6d)[0x2ace22c1b9fd]

[starpu][_starpu_select_src_node][assert failure] The data for the handle 0x385d040 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?

gemm_gpu: datawizard/coherency.c:69: _starpu_select_src_node: Assertion `0 && "src_node_mask != 0"' failed.
Aborted (core dumped)

using only one GPU. Using 2, 3, or 4 GPUs the error stays the same.

sthibaul added the question Further information is requested label Mar 27, 2024

grisuthedragon closed this as completed May 29, 2024

grisuthedragon reopened this May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

grisuthedragon commented Mar 11, 2024

grisuthedragon commented Mar 21, 2024 •

edited

Loading

sthibaul commented Mar 27, 2024

grisuthedragon commented Mar 27, 2024

sthibaul commented Mar 27, 2024

grisuthedragon commented Mar 28, 2024

grisuthedragon commented Apr 22, 2024

sthibaul commented Apr 23, 2024

grisuthedragon commented Apr 29, 2024

grisuthedragon commented May 29, 2024 •

edited

Loading

sthibaul commented Aug 29, 2024

grisuthedragon commented Aug 29, 2024

sthibaul commented Aug 30, 2024

sthibaul commented Aug 30, 2024

sthibaul commented Aug 30, 2024

grisuthedragon commented Sep 3, 2024

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

Comments

grisuthedragon commented Mar 11, 2024

StarPU 1.3.11 / CUDA 11.8 / GCC 12

StarPU 1.3.11 / CUDA 12.2/ GCC 12

StarPU 1.4.4 / CUDA 12.2/ GCC 12

grisuthedragon commented Mar 21, 2024 • edited Loading

sthibaul commented Mar 27, 2024

grisuthedragon commented Mar 27, 2024

sthibaul commented Mar 27, 2024

grisuthedragon commented Mar 28, 2024

grisuthedragon commented Apr 22, 2024

sthibaul commented Apr 23, 2024

grisuthedragon commented Apr 29, 2024

grisuthedragon commented May 29, 2024 • edited Loading

sthibaul commented Aug 29, 2024

grisuthedragon commented Aug 29, 2024

sthibaul commented Aug 30, 2024

sthibaul commented Aug 30, 2024

sthibaul commented Aug 30, 2024

grisuthedragon commented Sep 3, 2024

grisuthedragon commented Mar 21, 2024 •

edited

Loading

grisuthedragon commented May 29, 2024 •

edited

Loading