Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly test failures with cusolver tpl enabled, Cuda.svd_* unit tests #2096

Closed
ndellingwood opened this issue Feb 6, 2024 · 6 comments
Closed

Comments

@ndellingwood
Copy link
Contributor

Nightly test failures occurring with Cusolver enabled in the svd unit tests of the form "CUSOLVER does not support SVD for matrices with more columns than rows..."

...
[ RUN      ] Cuda.svd_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_float (215 ms)
[ RUN      ] Cuda.svd_double
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_double (169 ms)
[ RUN      ] Cuda.svd_complex_float
Running impl_test_svd with sizes: 0x0
Running impl_test_svd with sizes: 1x1
Running impl_test_svd with sizes: 15x15
Running impl_test_svd with sizes: 100x100
Running impl_test_svd with sizes: 100x70
Running impl_test_svd with sizes: 70x100
unknown file: Failure
C++ exception with description "CUSOLVER does not support SVD for matrices with more columns than rows, you can transpose you matrix first then compute SVD of that transpose: At=VSUt, and swap the output U and Vt and transpose them to recover the desired SVD." thrown in the test body.
[  FAILED  ] Cuda.svd_complex_float (181 ms)
[----------] 12 tests from Cuda (8252 ms total)

[----------] Global test environment tear-down
[==========] 12 tests from 1 test case ran. (8252 ms total)
[  PASSED  ] 9 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] Cuda.svd_float
[  FAILED  ] Cuda.svd_double
[  FAILED  ] Cuda.svd_complex_float

Adding @lucbv , cross-reference #2092

Reproducer (weaver rhel8 queue):

source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
module load cuda/11.2.2/gcc/8.3.1 cmake/3.23.1

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH}  --cxxflags=${CXXFLAGS} --with-scalars='float,complex_float' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-tpls=cusparse,cublas,cusolver --cxxstandard=17
@lucbv
Copy link
Contributor

lucbv commented Feb 6, 2024

Hum, this was tested with the auto-tester but I guess there is at least a corner case in which we are calling the test when we really should not.
I'll have a fix for that this week, thanks for pinging me @ndellingwood

@lucbv lucbv self-assigned this Feb 6, 2024
@ndellingwood
Copy link
Contributor Author

@lucbv thanks for #2103 , that resolved the "CUSOLVER does not support SVD for matrices with more columns than rows... type messages but I am still seeing tolerance-related failures in the cuda/11.2.2 build on Weaver

07:04:28 [ RUN      ] Cuda.svd_float
07:04:28 Running impl_test_svd with sizes: 0x0
07:04:28 Running impl_test_svd with sizes: 1x1
07:04:28 Running impl_test_svd with sizes: 15x15
07:04:28 Running impl_test_svd with sizes: 100x100
07:04:28 Running impl_test_svd with sizes: 100x70
07:04:28 Running impl_test_svd with sizes: 70x100
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
07:04:28 /home/jenkins/weaver/workspace/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp:159: Failure
07:04:28 Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
...
07:04:31 [  FAILED  ] Cuda.svd_float
07:04:31 [  FAILED  ] Cuda.svd_double
07:04:31 [  FAILED  ] Cuda.svd_complex_float

@lucbv
Copy link
Contributor

lucbv commented Feb 13, 2024

@ndellingwood this particular one seems to be gone although there is a new Cuda issue in the nightly that looks like this:

Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209
<https://jenkins-son.sandia.gov/job/KokkosKernels_Weaver_Cuda_Serial_cuda_1122_gcc_831_cusparse_cublas_cusolver_float_cplxfloat/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 0.00119209

and also an issue with oneMKL that looks similar?

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 8 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 3 vs 6.66134e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 4 vs 8.88178e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 5 vs 1.11022e-13

Intel MKL ERROR: Parameter 10 was incorrect on entry to DGEMM .
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14
<https://jenkins-son.sandia.gov/job/KokkosKernels_Nightly_Blake_OneAPI_2023_1_0_OpenMP_Serial_SPR-oneMKL/ws/kokkos-kernels/test_common/KokkosKernels_TestUtils.hpp>:159: Failure
Expected: ((double)AT1::abs(val1 - val2)) <= ((double)AT3::abs(tol)), actual: 1 vs 2.22045e-14

@ndellingwood
Copy link
Contributor Author

@lucbv I posted as a follow on to this issue. I'll open a separate issue for tracking

@lucbv
Copy link
Contributor

lucbv commented Feb 13, 2024

Okay let me know if you want to close this one then? I think the issue with the non-square matrices on CUDA should be resolved but the problem above is new so will need to investigate.
The DGEMM complaint by MKL makes me think that there is a problem in how I check the unitary matrices or even the triple product for USVt = A... so hopefully should be an easy fix? It is a bit interesting that it only appears now and not in previous builds?

@ndellingwood
Copy link
Contributor Author

@lucbv correct, this issue is resolved so I'll open a new issue for the different types of failures. There had been some preexisting MKL failures and I hadn't noticed the new stuff come through, thanks for catching that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants