Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failures, Cuda.svd_* and MKL DGEMM #2105

Closed
ndellingwood opened this issue Feb 13, 2024 · 5 comments
Closed

Nightly test failures, Cuda.svd_* and MKL DGEMM #2105

ndellingwood opened this issue Feb 13, 2024 · 5 comments
Assignees
Labels
bug Cleanup Code maintenance that isn't a bugfix or new feature InDevelop

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Feb 13, 2024

Nightly test failures, follow up to #2096

Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209

and also an issue with oneMKL that looks similar?

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Originally posted by @lucbv in #2096 (comment)

Reproducer (weaver rhel8 queue):

source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH}  --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17
@lucbv
Copy link
Contributor

lucbv commented Feb 14, 2024

@ndellingwood the PR above should fix the CUDA side of the problem. I will test it on the Blake oneAPI build and see if that cleans things up there as well.
Interestingly I did not do any thing for PVC so I would expect that to not run anything but if we have MKL enabled on the host side there could be something going on there... maybe will do a second PR so that we can merge the first one quickly to clean-up some of our nightly builds!

@lucbv
Copy link
Contributor

lucbv commented Feb 15, 2024

Okay, so far not seeing the CUDA error this morning, let us wait until the afternoon for potentially late tests finishing later but this looks like a promising start. I'll have a look at the Intel/MKL issue, hopefully I can sort it out and close this issue soon! : )

@lucbv
Copy link
Contributor

lucbv commented Feb 15, 2024

Okay PR #2110 just merged so let's keep an eye on this.
I plan on making a bigger subsequent PR that will address all of the BLAS kernels so that they can run properly depending on MKL's choice of integer width... This should clean significantly some segfaults we see in the nightly oneapi builds!

@lucbv
Copy link
Contributor

lucbv commented Feb 16, 2024

@ndellingwood this should be resolved now, I did not see the error come up in last night's build.
One more thing though, I am adding PR #2112 to generalize the fix and hopefully clean-up some of our oneMKL issues.

@lucbv
Copy link
Contributor

lucbv commented Feb 19, 2024

Okay, PR #2112 has merged now, let us see if we see improvements in our nightly build on Blake.
I think some should have quite a few unit-test passing now!

@lucbv lucbv self-assigned this Feb 20, 2024
@lucbv lucbv added bug InDevelop Cleanup Code maintenance that isn't a bugfix or new feature labels Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Cleanup Code maintenance that isn't a bugfix or new feature InDevelop
Projects
None yet
Development

No branches or pull requests

2 participants