-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the problem in QR. #696
Conversation
As we use random matrices for testing, sometimes bad guys show up. We need to make the matrices more easier to solve and residual check should be done element-wise which does not increase with the problem size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @kyungjoo-kim ! Do you have spot-check or test results to post?
I use your reproducers to check if this resolves the problems. I am running the cm_test_all_sandia scripts on white and blake and will post the output. |
I ran the test all sandia on kokkos-dev-2. Almost all compilers are okay except for a few failures that are not related to this PR. I am wondering if this is a known issue or not but I just report here. The failure is segfault in
Anyone who experience the same error ?
|
@brian-kelley are you able to take a look and try reproducing on kokkos-dev-2? @kyungjoo-kim is seeing intermittent seg faults with intel/17.0.1 in |
@ndellingwood @kyungjoo-kim Yes, I was able to replicate this (KokkosKernels_common_openmp crashes consistently for me). The CRS sorting stuff was merged in #663 about a month ago, but I only ran the spot-checks for that so Intel 17 was not used (so I suspect this is a compiler issue). I can still try to debug it though, there might be an actual bug or a workaround. |
It was actually crashing inside std::sort, when sorting a std::pair with a lambda as a custom comparator. I replaced the pair with a struct and |
Let us merge Brian's PR first, make sure QR passes all tests before merging. This is the second iteration, so better to be safe. |
@kyungjoo-kim @srajama1 My change #698 did fix all the Intel 17.0.1 builds on kokkos-dev2, but I'm seeing a couple new issues:
This is CUDA 9.2.88 CUDA/OpenMP release, so it has UVM enabled. Maybe just a missing fence in the test? I had a bunch of those that I needed to fix. This test passed in all the kokkos-dev2 CUDA builds so far. The warnings for GCC 4.8.4 are in a bunch of places, here are a couple of them:
So I can look into fixing those too. |
@kyungjoo-kim can take care of the QR test as part of the other PR. @iyamazaki Can you look at the sptrsv warnings please ? |
@ndellingwood Why is this merged before spot checks ? There are still failures (see above) |
I wonder how this is compiled. For SpTrsv, I thought that part of the code is still protected with #if defined(KOKKOSKERNELS_ENABLE_TPL_SUPERLU) && and should not be compiled in? We'll try to address this in another PR (e.g., in PR 680). Please let us know if you figure more details related to SpTRSV warnings. Thank you!! |
This PR fixed the QR errors reported in #691 and @kyungjoo-kim reported that the PR fixed the #691 failures in his testing when I asked about the spot-check and was merged a couple days ago. I'm not sure why Kyungjoo had not encountered the |
@brian-kelley did the |
@ndellingwood Your merge appear before even Kyungjoo's spot-check output which had errors that Brian is looking at. Let us not merge anything before spot-check unless it is really low impact like fixing a comment. This set of changes are causing way too many stability issues that if someone reviews us end of the year this will show on top. |
Kyungjoo confirmed this PR fixed the issues reported in #691, wasn't clear spot-check results were going to be posted. I'll hold off on merges unless there are spot-check results on future PRs, but what Kyungjoo reported in the later test results wasn't caused by this PR, and merging this PR addressed the failing nightly tests reported in #691. We need to find the gaps in nightlies that aren't detecting what Brian and Kyungjoo reported here. |
@kyungjoo-kim The batched QR failed in the same way the second run, so I don't think it's random (or at least, it happens often). Here is how to reproduce on RIDE (White should be exactly the same, but I haven't actually tried it there).
EDIT: same failure if $CUDA_LAUNCH_BLOCKING and $CUDA_MANAGED_FORCE_DEVICE_ALLOC both exported to "1", so that's not the issue. The test that fails is "cuda.batched_scalar_team_vector_qr_double" which shouldn't involve complex at all, but the build does enable double and complex_double as scalars. I'm doing another build with just double enabled, to see if it still happens. That could be the hole in nightlies, not sure. |
@iyamazaki The issue with SPTRSV was just that a string literal in a macro didn't have quotes escaped. I fixed it in #698. |
@brian-kelley Let me try to reproduce it on white. |
@kyungjoo-kim You can remove the |
@brian-kelley White is too busy that I cannot greb a kepler node. Do you also encounter the error from a pascal node ? |
@kyungjoo-kim I'm not sure, I didn't try on a pascal node of RIDE. I didn't see the bug happening on my workstation (skylake + pascal + force_uvm) though. |
@kyungjoo-kim @brian-kelley the Here's the failure: https://jenkins-son.sandia.gov/job/KokkosKernels_KokkosDev_CLANG7_CUDA9/47/console Reproducer: ssh kokkos-dev
module load sems-env sems-cmake/3.12.2 kokkos-env kokkos-hwloc/1.10.1/base sems-clang/7.0.1 sems-cuda/9.2
$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,OpenMP --arch=Kepler35 --compiler=clang++ --cxxflags="-O3 -Wall -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="11" --ldflags="" --with-cuda --kokkos-path=$KOKKOS_PATH --kokkoskernels-path=$KOKKOSKERNELS_PATH --with-scalars='' --with-ordinals= --with-offsets= --with-layouts= --with-tpls= --with-options=disable_deprecated_code --with-cuda-options= --no-examples Based on where the test is failing it seems that cuda/9.2.88, Cuda_OpenMP build, with Kepler GPU are the commonalities? |
@ndellingwood Thanks. It is good that I can also reproduce it on kokkos-dev-2. I will try if I can reproduce it there and fix the problem. Sorry for this problem. |
@ndellingwood I also observed it on Cuda_Serial on RIDE, but otherwise yes. |
@kyungjoo-kim it was on kokkos-dev (the older machine, not kokkos-dev-2), important to reproduce there with the Kepler GPUs as it doesn't seem to occur with Volta (or will occur with low enough frequency we haven't seen it yet) |
@ndellingwood Unfortunately, I cannot reproduce the failure. I ran multiple times and the test passes. I also cannot access the jenkins failure link. Would you mind if I ask to put some more detailed error message from the jenkins ? |
@brian-kelley How frequently did you see the failure on ride ? @ndellingwood says that is is randomly failed but you seem to see the error more often. When we test kokkoskernels, we always test it against kokkos master branch, right ? I just want to match the same environment exactly. |
@kyungjoo-kim On RIDE Kepler (rhel7F queue) it happened every run for me. I was testing against kokkos develop branch. |
@kyungjoo-kim nightlies test against kokkos' develop branch. As far as nightlies this seems to be random, it has only failed once since last week on kokkos-dev (which is testing on Kepler). Hopefully the White rhel7F queue is available since @brian-kelley sees this every run there, though I think there may only be 1 Kepler node remaining so it can get congested. On kokkos-dev, if you test with Brian's reproducer instructions (use |
@kyungjoo-kim sure, here's the snip of the test failure:
Edit: Adding Jenkins link |
@ndellingwood @kyungjoo-kim I actually only ran on RIDE, since its Kepler queue is much less busy than White's. |
Oop, I translated RIDE to White, my mistake. |
@brian-kelley Okay. It is tricky. The nightly tests on white Kepler node do not report the same failure and kokkos-dev reports a random behavior (although I cannot reproduce the same) but the test on ride consistently reproduces the failure. The only way that I can fix this problem is to use ride but I do not have an access to the machine. sigh.... Ah... BTW.... It only reports a single entry comparison is wrong. The "2" means that there is a sign change. I think that this is from a trivial equation 1x1 matrix. I can take a look further but sometime I need help either from @brian-kelley and @ndellingwood who can access ride. |
The White nightlies are actually running on the Pascal queue due to the congestion issues, I had to also duplicate the Kokkos jobs to run on Pascal because the nightlies on Kepler queues were aborting. |
@ndellingwood Is the failure indeed one comparison failed or your copy paste of one case ? My test code tests 1024 samples. If this indeed happens on a single problem, this means that the same code runs fine most of other cases and fails on a single sample with sign flipping. |
@kyungjoo-kim I double checked, it is one comparison failure, the copy+paste is the full failure message. |
@ndellingwood Okay. It helps a lot and I can reduce the problem scope a lot. @brian-kelley Can I ask you to test my branch on ride as the ride is the one consistently reports the failure ?
|
@kyungjoo-kim Yes, I'll let you know what happens. |
@kyungjoo-kim It's not completely done yet, but |
@brian-kelley Thanks for letting me know that. If this fixes the problem, would you include th modifications into your PR ? |
@kyungjoo-kim Yes, I'll add it. I'm sure it fixed it because I ran KokkosKernels_batched_dla_cuda 100 times in a loop (RIDE/kepler) and they all passed. |
@srajama1 @ndellingwood I'm just gonna add float and complex_double to the |
@brian-kelley sounds good, based on status in that issue that means the test will begin failing until the kokkos changes go in, does that sound right? |
@ndellingwood That's true, should I just change it back for now and wait for the kokkos fix? I don't want everybody to get spammed with emails. |
@brian-kelley how about changing that test back, then create a duplicate of that test with the float and complex_double coverage (no need to schedule it for a different time since the test is so short), but only add you and me to email list for now. We can add others to the list once it begins passing. |
@ndellingwood Sounds good. |
@ndellingwood OK, that build is set up and replicated the errors in trilinos/Trilinos#7206, so we'll know when the Kokkos issue gets fixed. |
As seen in #691 , it reports three failure cases.
With TPL, the code goes into my test code. This part is work-in-progress and it is better not to be included in the test. Anyway, I fixed it.
QR is failed. This turns out that some random matrices are badly conditioned and it exceed the error threshold. I made the random matrices diagonal dominant and evaluate the correctness element-wisely so that the error does not proportionally increase with the problem size.
Vectorization on kokkos complex. When testing odd number of vector size (e.g., 3), the aggressive vectorization with complex is failed. This is supposed to be failed as the vectorization does not care about the correctness but it issued a vector instruction. Somehow this passes the test so far and it pops up now. I remove the aggressive vectorization for non-built-in types i.e., complex.
@ndellingwood This fix all the problems.