Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failures in Cuda with rdc+uvm builds: cuda.sparse_block_spgemm tests #1413

Closed
ndellingwood opened this issue May 18, 2022 · 3 comments

Comments

@ndellingwood
Copy link
Contributor

@lucbv also seeing some runtime test failures after merge of PR #1099 as well (there were no changes merged to kokkos the day this test began failing), for example in cuda/10.0 build with rdc and uvm enabled:

08:38:11 4: [ RUN      ] cuda.sparse_block_spgemm_kokkos_complex_double_int_int_TestExecSpace
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:274: Failure
08:38:11 4: Value of: is_expected_to_fail
08:38:11 4:   Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:277: Failure
08:38:11 4: Value of: failed
08:38:11 4:   Actual: true
08:38:11 4: Expected: is_expected_to_fail
08:38:11 4: Which is: false
08:38:11 4: entries are different.
08:38:11 4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
08:38:11 4: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:285: Failure
08:38:11 4: Value of: is_identical
08:38:11 4:   Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK
08:38:11 4: [  FAILED  ] cuda.sparse_block_spgemm_kokkos_complex_double_int_int_TestExecSpace (8487 ms)
08:38:11 4: [ RUN      ] cuda.sparse_block_spgemm_kokkos_complex_double_int_size_t_TestExecSpace
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:274: Failure
08:38:11 4: Value of: is_expected_to_fail
08:38:11 4:   Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:277: Failure
08:38:11 4: Value of: failed
08:38:11 4:   Actual: true
08:38:11 4: Expected: is_expected_to_fail
08:38:11 4: Which is: false
08:38:11 4: entries are different.
08:38:11 4: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
08:38:11 4: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 
08:38:11 4: /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_Cuda_cuda_100_gcc_740_rdc-uvm/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:285: Failure
08:38:11 4: Value of: is_identical
08:38:11 4:   Actual: false
08:38:11 4: Expected: true
08:38:11 4: SPGEMM_KK
08:38:11 4: [  FAILED  ] cuda.sparse_block_spgemm_kokkos_complex_double_int_size_t_TestExecSpace (8494 ms)

Reproducer (weaver):

module load cmake/3.19.3 cuda/10.0.130 ibm/xl/16.1.1 gcc/7.4.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Power9,Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="14" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda,uvm,rdc   --no-examples

Edit: Also occurs with cuda/9.2.88 on the same system

Originally posted by @ndellingwood in #1395 (comment)

@ndellingwood ndellingwood changed the title Nightly test failures in Cuda with rdc+uvm: cuda.sparse_block_spgemm tests Nightly test failures in Cuda with rdc+uvm builds: cuda.sparse_block_spgemm tests May 18, 2022
@ndellingwood
Copy link
Contributor Author

I split this issue out from #1395 (filed as build errors) which had collected various nightly failures following #1099

@lucbv
Copy link
Contributor

lucbv commented Jul 19, 2022

This should be resolved with @brian-kelley PR #1470
@ndellingwood the PR was just merged this morning, let's keep an eye on this tomorrow, hopefully we should see the uvm+rdc build passing.

@ndellingwood
Copy link
Contributor Author

The rdc+uvm nightlies that had #1470 merged resumed passing, thanks for fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants