Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failure with cuda/11.1 in cuda.batched_scalar_team_CG_double #1212

Closed
ndellingwood opened this issue Dec 3, 2021 · 6 comments
Closed

Comments

@ndellingwood
Copy link
Contributor

Following merge of PR #1155 the cuda/11.1 the cuda.batched_scalar_team_CG_double test began failing in nightly builds with cuda/11.1 + gcc/8.3.0 (host compiler) on Volta70 arch:

Failure output:

1:34:38       Start  3: batched_sla_cuda
21:34:38 
21:34:38 3: Test command: /data/jenkins-new/workspace/KokkosKernels_semsrhel7gpu01_cuda111_gcc830/TestAll_2021-12-02_20.33.06/cuda/11.1/Cuda_Pthread-release/unit_test/KokkosKernels_batched_sla_cuda
21:34:38 3: Test timeout computed to be: 2500
21:34:38 3: [==========] Running 7 tests from 1 test case.
21:34:38 3: [----------] Global test environment set-up.
21:34:38 3: [----------] 7 tests from cuda
21:34:38 3: [ RUN      ] cuda.batched_scalar_serial_spmv_nt_double_double
21:34:38 3: [       OK ] cuda.batched_scalar_serial_spmv_nt_double_double (73 ms)
21:34:38 3: [ RUN      ] cuda.batched_scalar_team_CG_double
21:34:38 3: terminate called after throwing an instance of 'std::runtime_error'
21:34:38 3:   what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /data/jenkins-new/workspace/KokkosKernels_semsrhel7gpu01_cuda111_gcc830/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:140
21:34:38 3: Traceback functionality not available
21:34:38 3: 
21:34:38  3/23 Test  #3: batched_sla_cuda .................Child aborted***Exception:   0.82 sec

Kokkos changes between the previous night's passing test were non-impactful on kokkos-kernels testing (removal trailing white-space, removal of a comment, CI-script related changes, no source code changes)

Reproducer:
SHAs
kokkos/kokkos@c840dad
133b7fc

module load sems-env sems-cmake/3.17.1 sems-cuda/11.1 sems-gcc/8.3.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="14" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda   --no-examples

@kliegeois may I assign to you to investigate? I think this should be reproducible on the kokkos-dev-2 test machine

@lucbv
Copy link
Contributor

lucbv commented Dec 3, 2021

Yes, I am adding @kliegeois as assignee, it's part of the learning pain to see how the code builds and runs on all the nightly configurations : )

@kliegeois
Copy link
Contributor

I am on it!

@kliegeois
Copy link
Contributor

I confirm that it is reproducible on the kokkos-dev-2 test machine; I am starting to investigate it.

@ndellingwood
Copy link
Contributor Author

@kliegeois another nightly build on Weaver had some uninitialized variables trigger -Werror, hopefully you can catch these while investigating the cuda issue as well:

03:11:45 /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_SerialOpenMP_gcc_720_cpp17/kokkos-kernels/src/batched/sparse/impl/KokkosBatched_GMRES_Team_Impl.hpp:211:19: error: uninitialized variable ‘G_new’ in ‘constexpr’ function
03:11:45                    G_new;
...
03:11:45 /home/jenkins/weaver-new/workspace/KokkosKernels_Weaver_SerialOpenMP_gcc_720_cpp17/kokkos-kernels/src/batched/sparse/impl/KokkosBatched_GMRES_Team_Impl.hpp:212:61: error: uninitialized variable ‘alpha’ in ‘constexpr’ function
03:11:45                typename VectorViewType::non_const_value_type alpha;

Reproducer (Weaver testbed):

module load cmake/3.19.3 gcc/7.2.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Serial --arch=Power9,Volta70 --compiler=g++ --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized " --cxxstandard="17" --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --no-examples

kliegeois added a commit to kliegeois/kokkos-kernels that referenced this issue Dec 7, 2021
lucbv added a commit that referenced this issue Dec 8, 2021
@kliegeois
Copy link
Contributor

@ndellingwood did it pass correctly?

@ndellingwood
Copy link
Contributor Author

@kliegeois your PR resolved the failing nightlies, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants