-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361
Comments
What I don't understand is why the Trilinos cuda-9.2 PR build is not also showing these tests as failing? This is something we should look into. |
@bartlettroscoe I have a fix being tested locally for the following type of error:
We essentially need to ignore this complexity output as the aggregation process is now non deterministic on GPU which leads to slight variations on operator complexities. However I strongly suspect that the last error:
is of a different nature and unrelated to my changes. |
@lucbv, is is possible for the solver to throw:
and still have the test pass? We could search the test output on CDash to see if that output has been seen in passing versions of this test. |
…kkos, see issue trilinos#6361 After kokkos refactor of aggregation, non deterministic aggregates are formed. This means that checking operator complexity against a gold file is wrong. This commits implements logic to ignore OperatorComplexity for kokkos runs of ParameterListInterpreter.
…omplexity Automatically Merged using Trilinos Pull Request AutoTester PR Title: MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue #6361 PR Author: lucbv
…s:develop' (2a3751d). * trilinos-develop: MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361 ML: Default fix ML: More experimental coarsening ML: enabling more experimental maxwell stuff ML: enabling more experimental maxwell stuff ML: enabling more experimental maxwell stuff Xpetra: Re-enabled fast TwoMatrixAdd path
…s:develop' (2a3751d). * trilinos-develop: (50 commits) TSQR: Fix minor build error with CUDA TSQR::Matrix: Simplify nonmember functions TSQR::Combine*::apply_inner now takes MatView instead of a pointer TSQR::Combine*::factor_inner now takes MatView instead of a pointer TSQR::Combine: Remove unneeded factor_first overload TSQR::Combine*::factor_pair now takes MatView instead of a pointer MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361 Ifpack2: spelling Ifpack2: Fixing test to only run in parallel Ifpack2: Fixing test Ifpack2: OverlappingRowMatrix cleanup Ifpack2: Adding unit test for 'reduced' matvec for use in s-step methods TSQR::Combine*::factor_first now takes MatView instead of a pointer TSQR: Remove fill_matrix itself TSQR::DistTsqr: Remove uses of fill_matrix TSQR::SequentialCholeskyTsqr: Remove uses of fill_matrix TSQR::SequentialTsqr: Remove uses of fill_matrix TSQR: Remove copy_matrix itself TSQR: Remove all uses of copy_matrix TSQR: Remove more uses of copy_matrix ...
…s:develop' (2a3751d). * trilinos-develop: (50 commits) TSQR: Fix minor build error with CUDA TSQR::Matrix: Simplify nonmember functions TSQR::Combine*::apply_inner now takes MatView instead of a pointer TSQR::Combine*::factor_inner now takes MatView instead of a pointer TSQR::Combine: Remove unneeded factor_first overload TSQR::Combine*::factor_pair now takes MatView instead of a pointer MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361 Ifpack2: spelling Ifpack2: Fixing test to only run in parallel Ifpack2: Fixing test Ifpack2: OverlappingRowMatrix cleanup Ifpack2: Adding unit test for 'reduced' matvec for use in s-step methods TSQR::Combine*::factor_first now takes MatView instead of a pointer TSQR: Remove fill_matrix itself TSQR::DistTsqr: Remove uses of fill_matrix TSQR::SequentialCholeskyTsqr: Remove uses of fill_matrix TSQR::SequentialTsqr: Remove uses of fill_matrix TSQR: Remove copy_matrix itself TSQR: Remove all uses of copy_matrix TSQR: Remove more uses of copy_matrix ...
CC: @trilinos/muelu, @srajama1 FYI: As shown in this query (click the "Show Matching Output" link at the top to see the errors) we are still seeing many failures of the tests:
in the builds:
showing errors like:
and
and
and
and
and
and many others like this. In fact, between the dates [2019-10-01, 2019-12-17], there where 377 of these failures across these various builds. Looking at this query, the last time one of these tests failed that did not match the regex Something seems to have changed recently that have added many new failures in the 'cee-rhel6', 'sems-rhel6' and 'sems-rhel7' non-CUDA builds. These tests seem to be about the most fragile of all the Trilinos tests if you look back at the last 1.5 years of effort trying to clean up these Trilinos builds and keep them clean. I wonder if this is not a good approach for writing a portable test suite? |
Just to be clear, as shown in this query there were 198 failures of these tests across all of these 'cee-rhel6', 'sems-rhel6', 'sems-rhel7', 'waterman', and 'ride' builds in the recent date range [2019-12-01, 2019-12-17], which is well after PR #6364 was merged on 2019-11-26. |
@bartlettroscoe I will at least take a look at the failures on non-Cuda builds as the test can reasonably be expected to pass for them. |
FYI: I changed to |
CC: @trilinos/framework FYI: As shown in this query, the last two post-push CI builds showed the test |
FYI: And the failing test |
@bartlettroscoe looking at the failure it seems that the test machines are randomly adding line breaks to outputs. This what is leading to the failures we are seeing. |
I agree, it doesn't look like a MueLu bug:
|
Looking at the recent changes in MueLu I am actually suspecting that PR #6432 is a potential offender for the non CUDA errors... |
@lucbv, tests that read and write files are fragile in generally. Is it possible to refactor these tests so that they write into an |
@bartlettroscoe that type of refactor is going to be non-trivial and potentially quite long. |
Yes, I believe #6432 is to blame. I'll have another look at this next year. |
CC: @trilinos/framework
@cgcgcg, then you may want to rever the PR #6432 until this can get fixed because it looks like this has taken out PR testing as shown in this query which shows this test failing 7 times in testing of 6 different PRs. Not speaking for other developers but I have a PR that I would like to get merged before 2020 so it might be good to allow this to happen before then by backing out #6432. And then someone should seriously consider refactoring how these |
FYI: As shown in this query and this query, failures with the test Therefore, perhaps people will get lucking and this test may pass in some PR testing iterations allowing the PRs to get merged. |
@bartlettroscoe wrote:
Valid point, but as @lucbv points out, this would require diverting resources. @trilinos/muelu will discuss this in the new year. |
@trilinos/framework FYI: As shown in:
you can see that this test is failing randomly in these two builds 25% and 15% of the time, respectively. That explains how the original PR #6432 that introduced this defect was able to be merged to 'develop'. The last of the 4 PR testing iterations that ran for PR #6432 had this test passing. This test killed at least one of the PR testing iterations in PR #6452 as shown here. The Trilinos PR tester is designed to ignore randomly failing tests. And once one of those tests gets on the 'develop' branch, if that test has a moderately high probability of failing, then it results in mass failures of PR testing iterations. |
@bartlettroscoe Feel free to revert. This should take care of the random component. |
@bartlettroscoe I have an updated version of PR #6432 that introduced the random failures: PR #6531. |
FYI: As shown in this query, lots more failures of the tests:
in the builds on 'vortex':
showing diff errors (click the "Show Matching Output" link at top right) like:
|
These seem to just be very fragile tests. Can we just turn these tests off in all of the ATDM Trilinos bulids? If you look at this query and this query, you can see these these tests pop up in a lot of bug reports. The either need to be rewritten to be more robust or they just need to be disabled in the ATDM Trilinos builds. (They can stay in in all of the other builds like the PR builds.) Okay? |
@bartlettroscoe The @trilinos/muelu team will have a look and make a decision about the tests. Please don't disable them yet. |
@jhux2, okay, let me know how that goes. |
@bartlettroscoe We have been working on this issue, and we believe that this should be fixed now. Let's monitor this for a week or so before we declare victory? |
Okay FYI: @rmmilewi will be working on a bot that will automatically update issues like this for the status of the related tests (see #3778). That will eliminate the manual work to follow up on GitHub issues like this. |
FYI: The tests:
are no longer running in ATDM Trilinos 'complex' builds because GEMMA does not use MueLu. Therefore, these tests are not even running in the builds:
|
@bartlettroscoe Can we close this? |
Results yesterday shown here showed all of these tests passing except for timeouts of the test:
We have seen problems with timeouts other tests in these builds such as reported in #6804, #6801, #6799 (which is likely a problem with running multiple kernels on the same GPU which will hopefully be addressed by #6840). Excluding timeouts, as shown in this query there have been no failing Therefore, since this issue #6361 is about failures and not timeouts, we can close this issue. (I will open a new issue for the timeouts). |
Closing as complete. |
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52, @lucbv
Next Action Status
Description
As shown in this query the tests:
MueLu_ParameterListInterpreterTpetra_MPI_1
MueLu_ParameterListInterpreterTpetra_MPI_4
MueLu_ParameterListInterpreterTpetraHeavy_MPI_4
in the builds:
Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
('white')Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
on the machines 'white','ride', and 'waterman' started failing every day starting 2019-11-23.
As shown in this query all of the failing tests show at lease one of the following error outputs.
Like here showing:
and here showing:
and here showing:
But note that these tests pass in the 'sems-rhel7-cuda-9.2' builds. Therefore, there is something different about these Power/GPU machines that are triggering these failures.
The new commits that were pulled in to these builds on 2019-11-26 that these failures started are show, for example, here.
From looking over that set of commits, it seems likely this was triggered by the commits in the PR #6326 from @lucbv (topic branch
MueLu_aggregation_kokkos_fixes
).Current Status on CDash
NOTE: Click "Previous" to see status for previous testing day.
Steps to Reproduce
One should be able to reproduce this failure on the machine as described in:
More specifically, the commands given for the system 'white' (SON) and 'ride' (SRN) are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: