Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KokkosBatched QR PR breaking nightly tests #691

Closed
ndellingwood opened this issue Apr 16, 2020 · 13 comments
Closed

KokkosBatched QR PR breaking nightly tests #691

ndellingwood opened this issue Apr 16, 2020 · 13 comments

Comments

@ndellingwood
Copy link
Contributor

ndellingwood commented Apr 16, 2020

Merge of PR #686 is breaking various nightly kokkos-kernels PR testing.

@kyungjoo-kim if this will take awhile to fix we should revert the commit

Sample error messages:
kokkos-dev-2 tpl spot-check intel-18.0.5-OpenMP-release build failure

/home/jenkins/slave/workspace/KokkosKernels_KokkosDev2_SPOTCHECK_TPLS/kokkos-kernels/src/batched/KokkosBatched_Eigendecomposition_Serial_Internal.hpp(90): error: name followed by "::" must be a class or namespace name
            SerialCopyInternal::invoke(m, m, QZ, qzs0, qzs1);
...

Reproducer:
kokkos-dev-2

  #   Load modules:
        module load sems-env sems-cmake/3.12.2 kokkos-env kokkos-hwloc/1.10.1/base sems-intel/18.0.5

  #   Use generate_makefile line below to call cmake which generates makefile for this build:
        $KokkosKernelsPath/cm_generate_makefile.bash --with-devices=OpenMP --arch=SNB,Volta70 --compiler=/projects/sems/install/rhel7-x86_64/sems/compiler/intel/18.0.5/base/bin/icpc --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags=""  --kokkos-path=$KokkosPath --kokkoskernels-path=$KokkosKernelsPath --with-scalars='' --with-ordinals= --with-offsets= --with-layouts= --with-tpls=mkl    --with-options= --with-cuda-options= --no-examples --with-options=enable_large_mem_tests

white cuda-9.2.88-Cuda_Serial-debug
(cuda-9.2.88-OpenMP_Cuda-release also failing)

The following tests FAILED:
          2 - batched_dla_cuda (Failed)

Reproducer

  #   Load modules:
        module load cmake/3.12.3 cuda/9.2.88 gcc/7.2.0 ibm/xl/16.1.0

  #   Use generate_makefile line below to call cmake which generates makefile for this build:
        $KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Power8,Pascal60 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-g -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags="" --with-cuda=/home/projects/ppc64le-pwr8-nvidia/cuda/9.2.88 --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='' --with-ordinals= --with-offsets= --with-layouts= --with-tpls=    --with-options= --with-cuda-options=enable_lambda,force_uvm --no-examples --with-options=enable_large_mem_tests --debug

blake intel-19.3.199-OpenMP-release

The following tests FAILED:
          2 - batched_dla_openmp (Failed)

Reproducer

  #   Load modules:
        module load cmake/3.12.3 intel/compilers/19.3.199

  #   Use generate_makefile line below to call cmake which generates makefile for this build:
        $KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=OpenMP --arch=SKX --compiler=/home/projects/x86-64/intel/compilers/2019/compilers_and_libraries_2019.3.199/linux/bin/intel64/icpc --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags=""  --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='complex_double' --with-ordinals= --with-offsets= --with-layouts= --with-tpls=    --with-options= --with-cuda-options= --no-examples --with-options=enable_large_mem_tests
@kyungjoo-kim
Copy link
Contributor

@ndellingwood I will take a loook at this today.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Apr 16, 2020

@ndellingwood

Thanks for providing the reproducers. The three tests shows different failures.

  1. With intel MKL, there is a code path that I tested in the eigen solver. It is work-in-progress and I have some mistakes there. Now kokkoskernels tests TPLs, an error was caught. This is fixed.
  2. On White, it indeeds fails a test and this failure is related with my PR. I am not sure why I didn't get this error before when I use test all sandia script. I will investigate this failure further.
    [ FAILED ] 1 test, listed below:
    [ FAILED ] cuda.batched_scalar_team_vector_qr_double
  3. On Blake, the failure is on my complex simd operations, which is not touched by my PR. If you don't mind, would you confirm that you get the same failures ?
    [ FAILED ] 3 tests, listed below:
    [ FAILED ] openmp.batched_vector_arithmatic_simd_dcomplex3
    [ FAILED ] openmp.batched_vector_arithmatic_simd_dcomplex2
    [ FAILED ] openmp.batched_vector_arithmatic_simd_dcomplex4
    If so, there is something changed in kokkos complex and it triggers these errors. The error comes from a unery operator of "-". I think that there is some easy fix for this. However, could you let me know if there is major changes in kokkos complex ?

@ndellingwood
Copy link
Contributor Author

Thanks for updates @kyungjoo-kim !

The only semi-recent update to Kokkos_Complex involves alignment changes that went in with 3.0.00 (2.9.99): kokkos/kokkos#2259

If it helps on Blake, you can try disabling complex alignment with the cmake option
-DKOKKOS_ENABLE_COMPLEX_ALIGN=OFF

In KokkosKernels, this PR matching a Trilinos PR was merged as part of 3.1,
https://github.com/kokkos/kokkos-kernels/pull/635/files
could the changes in src/batched/KokkosBatched_Vector_SIMD_View.hpp have affected the tests and not shown up if they were disabled in the CMake build system until you enabled them in the PR? I'm not familiar with the batched tests, not sure if the failures you posted touched code in that header

@kyungjoo-kim
Copy link
Contributor

How can I put the cmake option in your reproducer ?

@ndellingwood
Copy link
Contributor Author

How can I put the cmake option in your reproducer ?

Oh, good point it won't work directly with the reproducer instructions, but if you run the cm_generate_makefile script as in the reproducer it will output the CMake lines it uses to configure Kokkos before installing, and then configure lines for kokkos-kernels (requiring the kokkos install first). Copy out the Kokkos CMake configure line and append the cmake option then rerun cmake. That should let you test to see if the complex alignment is triggering the issue.
I don't want to waste your time trying this out unless the simd arith tests were disabled in the CMake build system until your update, the complex changes went into Kokkos awhile ago, and KokkosKernels nightlies test against Kokkos' develop branch so this should have popped up before if the tests were enabled.

@kyungjoo-kim
Copy link
Contributor

kyungjoo-kim commented Apr 16, 2020

I think that this is because of the alignment. When I print out the complex value,

/// this a is printed from input complex scalar
a = 6.813754e-01 -2.122341e-01
/// I store this in an array such as Kokkos::complex a[2]; a[0] = a_scalar; 
/// Then I printed a[0]
a 0.000000e+00 6.813754e-01

So, it looks like that the alignment does not match when I use a stack memory. I don't think that this would be problematic if we use aligned mmory allocations.

BTW, do we have a way to enforce alignment for a stack variable ?

@kyungjoo-kim
Copy link
Contributor

@ndellingwood I think that I can fix this today. I do not know why some errors pop up later after the test is passed. I am very sure that my qr also passes the test yesterday and now it does not pass.

@ndellingwood
Copy link
Contributor Author

do we have a way to enforce alignment for a stack variable ?

Brian added a clever utility that may help:

template<typename InPtr, typename T>
KOKKOS_INLINE_FUNCTION T* alignPtr(InPtr p)
{
//ugly but computationally free and the "right" way to do this in C++
std::uintptr_t ptrVal = reinterpret_cast<std::uintptr_t>(p);
//ptrVal + (align - 1) lands inside the next valid aligned scalar_t,
//and the mask produces the start of that scalar_t.
return reinterpret_cast<T*>((ptrVal + alignof(T) - 1) & (~(alignof(T) - 1)));
}

@ndellingwood
Copy link
Contributor Author

I'm not sure if that utility addresses the type of alignment issue you are seeing though

@kyungjoo-kim
Copy link
Contributor

I take it back. My print statement was wrong....

@kyungjoo-kim
Copy link
Contributor

What I figured out is the error happens when I use aggressive vectorization e.g., ivdep with complex values. This seems to replace some instructions with aligned vector instructions where the stack variable is not aligned. This is not related to kokkos complex alignment as the alignment issue pops out when I use odd number of vector array size e.g., 3. This is a right behavior but I do not know why previously this works. I put a guard to do aggressive vectorization for built-in type only.

@kyungjoo-kim
Copy link
Contributor

@ndellingwood Done.

@ndellingwood
Copy link
Contributor Author

PR #696 addressed the failing nightly tests reported in this issue, though @brian-kelley detected failures with the batched_scalar_team_vector_qr_double test not captured in nightlies, as mentioned in #696 (comment)
A separate issue should be opened to address that and we can determine the gap in nightlies that let it slip through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants