Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the problem in QR. #696

Merged
merged 3 commits into from
Apr 16, 2020
Merged

Conversation

kyungjoo-kim
Copy link
Contributor

As seen in #691 , it reports three failure cases.

  1. With TPL, the code goes into my test code. This part is work-in-progress and it is better not to be included in the test. Anyway, I fixed it.

  2. QR is failed. This turns out that some random matrices are badly conditioned and it exceed the error threshold. I made the random matrices diagonal dominant and evaluate the correctness element-wisely so that the error does not proportionally increase with the problem size.

  3. Vectorization on kokkos complex. When testing odd number of vector size (e.g., 3), the aggressive vectorization with complex is failed. This is supposed to be failed as the vectorization does not care about the correctness but it issued a vector instruction. Somehow this passes the test so far and it pops up now. I remove the aggressive vectorization for non-built-in types i.e., complex.

@ndellingwood This fix all the problems.

  As we use random matrices for testing, sometimes bad guys show up.
  We need to make the matrices more easier to solve and residual check
  should be done element-wise which does not increase with the problem
  size
@kyungjoo-kim kyungjoo-kim self-assigned this Apr 16, 2020
Copy link
Contributor

@ndellingwood ndellingwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @kyungjoo-kim ! Do you have spot-check or test results to post?

@kyungjoo-kim
Copy link
Contributor Author

kyungjoo-kim commented Apr 16, 2020

I use your reproducers to check if this resolves the problems. I am running the cm_test_all_sandia scripts on white and blake and will post the output.

@ndellingwood ndellingwood merged commit 94f2e5a into kokkos:develop Apr 16, 2020
@kyungjoo-kim
Copy link
Contributor Author

kyungjoo-kim commented Apr 17, 2020

I ran the test all sandia on kokkos-dev-2. Almost all compilers are okay except for a few failures that are not related to this PR. I am wondering if this is a known issue or not but I just report here. The failure is segfault in common_openmp and common_serial with intel 17. I don't think that this is deterministic and I am not sure if this is a compiler specific thing (maybe this just happens on intel 17). Sometimes it does not fail in some configuration and sometimes common_openmp only fails and sometimes common_openmp/serial fails. However, the all failures got caught with segfault at common_sort_crsgraph.

10: [ RUN      ] serial.common_sort_crsgraph
10/10 Test #10: common_serial ....................***Exception: SegFault  0.38 sec

Anyone who experience the same error ?

#######################################################
PASSED TESTS
#######################################################
cuda-10.0-Cuda_Pthread-release build_time=194 run_time=124
cuda-10.0-Cuda_Serial-release build_time=202 run_time=127
cuda-10.1-Cuda_Pthread-release build_time=197 run_time=124
cuda-10.1-Cuda_Serial-release build_time=203 run_time=127
cuda-9.2-Cuda_Pthread-release build_time=194 run_time=134
cuda-9.2-Cuda_Serial-release build_time=202 run_time=138
intel-15.0.2-OpenMP-release build_time=117 run_time=35
intel-15.0.2-OpenMP_Serial-release build_time=146 run_time=97
intel-15.0.2-Pthread-release build_time=92 run_time=56
intel-15.0.2-Pthread_Serial-release build_time=117 run_time=110
intel-15.0.2-Serial-release build_time=86 run_time=56
intel-16.0.1-OpenMP-release build_time=133 run_time=36
intel-16.0.1-OpenMP_Serial-release build_time=170 run_time=107
intel-16.0.1-Pthread-release build_time=100 run_time=65
intel-16.0.1-Pthread_Serial-release build_time=128 run_time=129
intel-16.0.1-Serial-release build_time=97 run_time=65
intel-17.0.1-Pthread-release build_time=111 run_time=58
intel-17.0.1-Pthread_Serial-release build_time=143 run_time=115
intel-17.0.1-Serial-release build_time=107 run_time=58
intel-18.0.5-OpenMP-release build_time=175 run_time=37
intel-18.0.5-OpenMP_Serial-release build_time=224 run_time=98
intel-18.0.5-Pthread-release build_time=125 run_time=59
intel-18.0.5-Pthread_Serial-release build_time=170 run_time=119
intel-18.0.5-Serial-release build_time=113 run_time=57
intel-19.0.5-OpenMP-release build_time=191 run_time=36
intel-19.0.5-OpenMP_Serial-release build_time=229 run_time=98
intel-19.0.5-Pthread-release build_time=202 run_time=57
intel-19.0.5-Pthread_Serial-release build_time=232 run_time=119
intel-19.0.5-Serial-release build_time=190 run_time=59
#######################################################
FAILED TESTS
#######################################################
intel-17.0.1-OpenMP-release (test failed)
#######################################################
[kyukim @kokkos-dev-2] test-all > ../../../kyukim/scripts/cm_test_all_sandia cuda intel 
Running on machine: kokkos-dev-2
KokkosKernels Repository Status:  3f245f02bbdd032ecb02b9be8dc8d8dc5304e5bf Merge branch 'qr-develop' of https://github.com/kyungjoo-kim/kokkos-kernels into qr-develop

Kokkos Repository Status:  7c1ab6cc7066d6b7efecdfd8cfc54482cae65caf Merge pull request #2957 from aprokop/install_trilinos_version

...
5/5 Test #5: common_openmp ....................***Exception: SegFault  0.21 sec
...
The following tests FAILED:
	  5 - common_openmp (SEGFAULT)
Errors while running CTest
  FAILED intel-17.0.1-OpenMP_Serial-release
The following tests FAILED:
	  5 - common_openmp (SEGFAULT)
	 10 - common_serial (SEGFAULT)
Errors while running CTest

@ndellingwood
Copy link
Contributor

Sometimes it does not fail in some configuration and sometimes common_openmp only fails and sometimes common_openmp/serial fails. However, the all failures got caught with segfault at common_sort_crsgraph.

@brian-kelley are you able to take a look and try reproducing on kokkos-dev-2? @kyungjoo-kim is seeing intermittent seg faults with intel/17.0.1 in common_sort_crsgraph added in PR #663

@brian-kelley
Copy link
Contributor

@ndellingwood @kyungjoo-kim Yes, I was able to replicate this (KokkosKernels_common_openmp crashes consistently for me). The CRS sorting stuff was merged in #663 about a month ago, but I only ran the spot-checks for that so Intel 17 was not used (so I suspect this is a compiler issue). I can still try to debug it though, there might be an actual bug or a workaround.

@brian-kelley
Copy link
Contributor

It was actually crashing inside std::sort, when sorting a std::pair with a lambda as a custom comparator. I replaced the pair with a struct and operator<, and now it doesn't crash. I'll put in a PR with the workaround but it doesn't block this from being merged.

@srajama1
Copy link
Contributor

Let us merge Brian's PR first, make sure QR passes all tests before merging. This is the second iteration, so better to be safe.

@brian-kelley
Copy link
Contributor

@kyungjoo-kim @srajama1 My change #698 did fix all the Intel 17.0.1 builds on kokkos-dev2, but I'm seeing a couple new issues:

2: [ RUN      ] cuda.batched_scalar_team_vector_qr_double
2: /ascldap/users/bmkelle/StdSortWorkaround-testing-1587158991/kokkos-kernels/unit_test/../test_common/KokkosKernels_TestUtils.hpp:88: Failure
2: The difference between double(AT1::abs(val1)) and double(AT2::abs(val2)) is 2, which exceeds double(AT3::abs(tol)), where 
2: double(AT1::abs(val1)) evaluates to 2, 
2: double(AT2::abs(val2)) evaluates to 0, and
2: double(AT3::abs(tol)) evaluates to 2.2204460492503131e-13.

This is CUDA 9.2.88 CUDA/OpenMP release, so it has UVM enabled. Maybe just a missing fence in the test? I had a bunch of those that I needed to fix. This test passed in all the kokkos-dev2 CUDA builds so far.

The warnings for GCC 4.8.4 are in a bunch of places, here are a couple of them:

/ascldap/users/bmkelle/StdSortWorkaround-fulltest-1587158944/kokkos-kernels/perf_test/sparse/KokkosSparse_sptrsv_superlu.cpp:705:14: error: invalid suffix on literal; C++11 requires a space between literal and identifier [-Werror=literal-suffix]
       #error "Invalid type specified in KOKKOSKERNELS_SCALARS, supported types are "double,complex<double>""
              ^
                 from /ascldap/users/bmkelle/StdSortWorkaround-fulltest-1587158944/Testing/TestAll_2020-04-17_14.29.10/gcc/4.8.4/OpenMP-release/src/impl/generated_specializations_cpp/trsm/Blas3_trsm_eti_DOUBLE_LAYOUTLEFT_EXECSPACE_OPENMP_MEMSPACE_HOSTSPACE.cpp:48:
/ascldap/users/bmkelle/StdSortWorkaround-fulltest-1587158944/kokkos-kernels/src/batched/KokkosBatched_Vector_SIMD.hpp:49:0: error: ignoring #pragma omp simd [-Werror=unknown-pragmas]
 #pragma omp simd 

So I can look into fixing those too.

@srajama1
Copy link
Contributor

@kyungjoo-kim can take care of the QR test as part of the other PR.

@iyamazaki Can you look at the sptrsv warnings please ?

@srajama1
Copy link
Contributor

@ndellingwood Why is this merged before spot checks ? There are still failures (see above)

@iyamazaki
Copy link
Contributor

I wonder how this is compiled. For SpTrsv, I thought that part of the code is still protected with

#if defined(KOKKOSKERNELS_ENABLE_TPL_SUPERLU) &&
defined(KOKKOSKERNELS_ENABLE_SUPERNODAL_SPTRSV)

and should not be compiled in?

We'll try to address this in another PR (e.g., in PR 680). Please let us know if you figure more details related to SpTRSV warnings. Thank you!!

@ndellingwood
Copy link
Contributor

@ndellingwood Why is this merged before spot checks ? There are still failures (see above)

This PR fixed the QR errors reported in #691 and @kyungjoo-kim reported that the PR fixed the #691 failures in his testing when I asked about the spot-check and was merged a couple days ago. I'm not sure why Kyungjoo had not encountered the batched_scalar_team_vector_qr_double failure Brian later reported, I merged based on report that this PR addressed the QR failures, but that test slipped through. Aside from that test the other errors reported above are not due to any changes in this PR but were pre-existing errors from merges of past PRs and shouldn't be blockers on this PR, separate issues should be open to track them.

@ndellingwood
Copy link
Contributor

This is CUDA 9.2.88 CUDA/OpenMP release, so it has UVM enabled. Maybe just a missing fence in the test? I had a bunch of those that I needed to fix. This test passed in all the kokkos-dev2 CUDA builds so far.

@brian-kelley did the batched_scalar_team_vector_qr_double test fail consistently for you on Ride? The White nightlies testing cuda/9.2 with Cuda_OpenMP began passing again after merge of this PR (all the nightlies reported in #691 resumed passing after the merge). Could you post reproducer instructions in a new issue, including the queue and options passed to cm_test_all_sandia (or other instructions if you encountered this in a different build)?

@srajama1
Copy link
Contributor

@ndellingwood Your merge appear before even Kyungjoo's spot-check output which had errors that Brian is looking at. Let us not merge anything before spot-check unless it is really low impact like fixing a comment. This set of changes are causing way too many stability issues that if someone reviews us end of the year this will show on top.

@ndellingwood
Copy link
Contributor

Your merge appear before even Kyungjoo's spot-check output which had errors that Brian is looking at

Kyungjoo confirmed this PR fixed the issues reported in #691, wasn't clear spot-check results were going to be posted. I'll hold off on merges unless there are spot-check results on future PRs, but what Kyungjoo reported in the later test results wasn't caused by this PR, and merging this PR addressed the failing nightly tests reported in #691. We need to find the gaps in nightlies that aren't detecting what Brian and Kyungjoo reported here.

@brian-kelley
Copy link
Contributor

brian-kelley commented Apr 20, 2020

@kyungjoo-kim The batched QR failed in the same way the second run, so I don't think it's random (or at least, it happens often). Here is how to reproduce on RIDE (White should be exactly the same, but I haven't actually tried it there).

module load cmake/3.12.3 cuda/9.2.88 gcc/7.2.0 ibm/xl/16.1.0
# Run on rhel7F queue (Power8 + Kepler K80)
bsub -Is -q rhel7F bash
export KOKKOS_SRC=...
export KOKKOSKERNELS_SRC=...
$KOKKOSKERNELS_SRC/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Power8,Kepler37 --compiler=$KOKKOS_SRC/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags="" --with-cuda=/home/projects/ppc64le-pwr8-nvidia/cuda/9.2.88 --kokkos-path=$KOKKOS_SRC --kokkoskernels-path=$KOKKOSKERNELS_SRC --with-scalars='double,complex_double' --with-ordinals= --with-offsets= --with-layouts= --with-tpls=    --with-options= --with-cuda-options=force_uvm --no-examples --with-options=enable_large_mem_tests

make -j32

cd unit_test
# make sure CUDA_MANAGED_FORCE_DEVICE_ALLOC and CUDA_LAUNCH_BLOCKING
# are not defined, or are defined to "0"
./KokkosKernels_batched_dla_cuda

EDIT: same failure if $CUDA_LAUNCH_BLOCKING and $CUDA_MANAGED_FORCE_DEVICE_ALLOC both exported to "1", so that's not the issue.

The test that fails is "cuda.batched_scalar_team_vector_qr_double" which shouldn't involve complex at all, but the build does enable double and complex_double as scalars. I'm doing another build with just double enabled, to see if it still happens. That could be the hole in nightlies, not sure.

@brian-kelley
Copy link
Contributor

@iyamazaki The issue with SPTRSV was just that a string literal in a macro didn't have quotes escaped. I fixed it in #698.

@kyungjoo-kim
Copy link
Contributor Author

@brian-kelley Let me try to reproduce it on white.

@brian-kelley
Copy link
Contributor

@kyungjoo-kim You can remove the --with-scalars='double,complex_double', since I just checked the double-only build and it was still reproduced.

@kyungjoo-kim
Copy link
Contributor Author

@brian-kelley White is too busy that I cannot greb a kepler node. Do you also encounter the error from a pascal node ?

@brian-kelley
Copy link
Contributor

@kyungjoo-kim I'm not sure, I didn't try on a pascal node of RIDE. I didn't see the bug happening on my workstation (skylake + pascal + force_uvm) though.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim @brian-kelley the cuda.batched_scalar_team_vector_qr_double had its first failure in the nightlies on kokkos-dev Sat night; it appears to be random, as the same nightly test resumed passing last night.

Here's the failure: https://jenkins-son.sandia.gov/job/KokkosKernels_KokkosDev_CLANG7_CUDA9/47/console

Reproducer:

ssh kokkos-dev

module load sems-env sems-cmake/3.12.2 kokkos-env kokkos-hwloc/1.10.1/base sems-clang/7.0.1 sems-cuda/9.2

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,OpenMP --arch=Kepler35 --compiler=clang++ --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags="" --with-cuda --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='' --with-ordinals= --with-offsets= --with-layouts= --with-tpls=    --with-options=disable_deprecated_code --with-cuda-options= --no-examples 

Based on where the test is failing it seems that cuda/9.2.88, Cuda_OpenMP build, with Kepler GPU are the commonalities?

@kyungjoo-kim
Copy link
Contributor Author

@ndellingwood Thanks. It is good that I can also reproduce it on kokkos-dev-2. I will try if I can reproduce it there and fix the problem. Sorry for this problem.

@brian-kelley
Copy link
Contributor

@ndellingwood I also observed it on Cuda_Serial on RIDE, but otherwise yes.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim it was on kokkos-dev (the older machine, not kokkos-dev-2), important to reproduce there with the Kepler GPUs as it doesn't seem to occur with Volta (or will occur with low enough frequency we haven't seen it yet)

@kyungjoo-kim
Copy link
Contributor Author

@ndellingwood Unfortunately, I cannot reproduce the failure. I ran multiple times and the test passes. I also cannot access the jenkins failure link. Would you mind if I ask to put some more detailed error message from the jenkins ?

@kyungjoo-kim
Copy link
Contributor Author

@brian-kelley How frequently did you see the failure on ride ? @ndellingwood says that is is randomly failed but you seem to see the error more often. When we test kokkoskernels, we always test it against kokkos master branch, right ? I just want to match the same environment exactly.

@brian-kelley
Copy link
Contributor

@kyungjoo-kim On RIDE Kepler (rhel7F queue) it happened every run for me. I was testing against kokkos develop branch.

@ndellingwood
Copy link
Contributor

When we test kokkoskernels, we always test it against kokkos master branch, right ?

@kyungjoo-kim nightlies test against kokkos' develop branch. As far as nightlies this seems to be random, it has only failed once since last week on kokkos-dev (which is testing on Kepler). Hopefully the White rhel7F queue is available since @brian-kelley sees this every run there, though I think there may only be 1 Kepler node remaining so it can get congested. On kokkos-dev, if you test with Brian's reproducer instructions (use --arch=Kepler35) hopefully it shows up consistently there as well?

@ndellingwood
Copy link
Contributor

ndellingwood commented Apr 20, 2020

Would you mind if I ask to put some more detailed error message from the jenkins ?

@kyungjoo-kim sure, here's the snip of the test failure:

04:35:09 2: [ RUN      ] cuda.batched_scalar_team_vector_qr_double
04:35:09 2: /home/jenkins/slave/workspace/KokkosKernels_KokkosDev_CLANG7_CUDA9/kokkos-kernels/unit_test/../test_common/KokkosKernels_TestUtils.hpp:88: Failure
04:35:09 2: The difference between double(AT1::abs(val1)) and double(AT2::abs(val2)) is 2, which exceeds double(AT3::abs(tol)), where
04:35:09 2: double(AT1::abs(val1)) evaluates to 2,
04:35:09 2: double(AT2::abs(val2)) evaluates to 0, and
04:35:09 2: double(AT3::abs(tol)) evaluates to 2.2204460492503131e-13.
04:35:09 2: [  FAILED  ] cuda.batched_scalar_team_vector_qr_double (37 ms)
04:35:09 2: [----------] 48 tests from cuda (4911 ms total)
04:35:09 2: 
04:35:09 2: [----------] Global test environment tear-down
04:35:09 2: [==========] 48 tests from 1 test case ran. (4911 ms total)
04:35:09 2: [  PASSED  ] 47 tests.
04:35:09 2: [  FAILED  ] 1 test, listed below:
04:35:09 2: [  FAILED  ] cuda.batched_scalar_team_vector_qr_double
04:35:09 2: 
04:35:09 2:  1 FAILED TEST
04:35:09  2/10 Test  #2: batched_dla_cuda .................***Failed   10.10 sec

Edit: Adding Jenkins link

@brian-kelley
Copy link
Contributor

@ndellingwood @kyungjoo-kim I actually only ran on RIDE, since its Kepler queue is much less busy than White's.

@ndellingwood
Copy link
Contributor

Oop, I translated RIDE to White, my mistake.

@kyungjoo-kim
Copy link
Contributor Author

@brian-kelley Okay. It is tricky. The nightly tests on white Kepler node do not report the same failure and kokkos-dev reports a random behavior (although I cannot reproduce the same) but the test on ride consistently reproduces the failure. The only way that I can fix this problem is to use ride but I do not have an access to the machine. sigh....

Ah... BTW.... It only reports a single entry comparison is wrong. The "2" means that there is a sign change. I think that this is from a trivial equation 1x1 matrix. I can take a look further but sometime I need help either from @brian-kelley and @ndellingwood who can access ride.

@ndellingwood
Copy link
Contributor

The White nightlies are actually running on the Pascal queue due to the congestion issues, I had to also duplicate the Kokkos jobs to run on Pascal because the nightlies on Kepler queues were aborting.

@kyungjoo-kim
Copy link
Contributor Author

@ndellingwood Is the failure indeed one comparison failed or your copy paste of one case ? My test code tests 1024 samples. If this indeed happens on a single problem, this means that the same code runs fine most of other cases and fails on a single sample with sign flipping.

@ndellingwood
Copy link
Contributor

@kyungjoo-kim I double checked, it is one comparison failure, the copy+paste is the full failure message.

@kyungjoo-kim
Copy link
Contributor Author

kyungjoo-kim commented Apr 20, 2020

@ndellingwood Okay. It helps a lot and I can reduce the problem scope a lot.

@brian-kelley Can I ask you to test my branch on ride as the ride is the one consistently reports the failure ?

[kyukim @kokkos-dev] master > git remote -v 
origin	https://github.com/kyungjoo-kim/kokkos-kernels.git (fetch)
origin	https://github.com/kyungjoo-kim/kokkos-kernels.git (push)
[kyukim @kokkos-dev] master > git branch 
  master
* qr-develop
      1 commit 03c27cb37c409110d7c49b39b62578e08314bc6c
      2 Author: Kyungjoo Kim <kyukim@kokkos-dev.sandia.gov>
      3 Date:   Mon Apr 20 12:34:47 2020 -0600
      4 
      5     KokkosBatched - remove some meaningless fence
      6 
      7 commit df3369c8b6307fb899d6c5ab6e14316f638c9561
      8 Author: Kyungjoo Kim <kyukim@kokkos-dev.sandia.gov>
      9 Date:   Mon Apr 20 12:32:30 2020 -0600
     10 
     11     KokkosBatched - single might be required

@brian-kelley
Copy link
Contributor

@kyungjoo-kim Yes, I'll let you know what happens.

@brian-kelley
Copy link
Contributor

@kyungjoo-kim It's not completely done yet, but cuda-9.2.88-Cuda_OpenMP-release and cuda-9.2.88-Cuda_Serial-release have both passed so far, so I think you fixed it. I'll go in and run those unit tests in a loop to make sure.

@kyungjoo-kim
Copy link
Contributor Author

@brian-kelley Thanks for letting me know that. If this fixes the problem, would you include th modifications into your PR ?

@brian-kelley
Copy link
Contributor

@kyungjoo-kim Yes, I'll add it. I'm sure it fixed it because I ran KokkosKernels_batched_dla_cuda 100 times in a loop (RIDE/kepler) and they all passed.

@brian-kelley
Copy link
Contributor

@srajama1 @ndellingwood I'm just gonna add float and complex_double to the KokkosKernels_KokkosDev2_CUDA10_1 build on SON jenkins, since that build only takes 6 minutes now, and this would have caught trilinos/Trilinos#7206 earlier.

@ndellingwood
Copy link
Contributor

@brian-kelley sounds good, based on status in that issue that means the test will begin failing until the kokkos changes go in, does that sound right?

@brian-kelley
Copy link
Contributor

@ndellingwood That's true, should I just change it back for now and wait for the kokkos fix? I don't want everybody to get spammed with emails.

@ndellingwood
Copy link
Contributor

should I just change it back for now and wait for the kokkos fix

@brian-kelley how about changing that test back, then create a duplicate of that test with the float and complex_double coverage (no need to schedule it for a different time since the test is so short), but only add you and me to email list for now. We can add others to the list once it begins passing.

@brian-kelley
Copy link
Contributor

@ndellingwood Sounds good.

@brian-kelley
Copy link
Contributor

@ndellingwood OK, that build is set up and replicated the errors in trilinos/Trilinos#7206, so we'll know when the Kokkos issue gets fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants