Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TpetraCore_BlockCrsMatrix_MPI_4 failing in ATDM cuda builds #4257

Closed
fryeguy52 opened this issue Jan 24, 2019 · 15 comments
Closed

TpetraCore_BlockCrsMatrix_MPI_4 failing in ATDM cuda builds #4257

fryeguy52 opened this issue Jan 24, 2019 · 15 comments
Assignees
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Jan 24, 2019

CC: @trilinos/tpetra, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4 seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. Next: Get PR #4326 merged which re-enables this test in the Trilinos CUDA PR build ...

Description

As shown in this query the test:

  • TpetraCore_BlockCrsMatrix_MPI_4

is failing in the builds:

  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

It is failing with the following output:

p=0: *** Caught standard std::exception of type 'std::logic_error' :
 
  /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp:2825:
  
  Throw number = 1
  
  Throw test that evaluated to true: numBytesOut != numBytes
  
  unpackRow: numBytesOut = 4 != numBytes = 156.
 [FAILED]  (0.0877 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
 Location: /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859
 
[white23:102556] *** An error occurred in MPI_Allreduce
[white23:102556] *** reported by process [231079937,0]
[white23:102556] *** on communicator MPI_COMM_WORLD
[white23:102556] *** MPI_ERR_OTHER: known error not in list
[white23:102556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[white23:102556] ***    and potentially your MPI job)

@kyungjoo-kim can you see if one of these commits may have caused this?

47f9cbe:  Tpetra - fix failing test
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Tue Jan 22 11:24:43 2019 -0700

M	packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp

3e26a55:  Tpetra - fix warning error from mismatched virtual functions
Author: Kyungjoo Kim (-EXP) <kyukim@bread.sandia.gov>
Date:   Mon Jan 21 11:48:32 2019 -0700

M	packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_decl.hpp
M	packages/tpetra/core/src/Tpetra_Experimental_BlockCrsMatrix_def.hpp

Current Status on CDash

The current status of these tests/builds for the current testing day can be found here

Steps to Reproduce

One should be able to reproduce this failure on ride or white as described in:

More specifically, the commands given for ride or white are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Tpetra=ON \
 $TRILINOS_DIR
$ make NP=16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16
@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Jan 24, 2019
@kyungjoo-kim
Copy link
Contributor

@fryeguy52 I will look at this problem today.

@kyungjoo-kim kyungjoo-kim self-assigned this Jan 24, 2019
@kyungjoo-kim
Copy link
Contributor

@fryeguy52 I followed the instruction to reproduce the error but I could not reproduce it on white. Could you double check if you can reproduce it ? I use

[kyukim @white11] master > git remote -v 
origin	https://github.com/trilinos/Trilinos.git (fetch)
origin	https://github.com/trilinos/Trilinos.git (push)
[kyukim @white11] master > git branch 
* develop
  master
[kyukim @white11] master > git log 
commit 01fb63caf88db53491f12afe0497c9d8f2cde09f
Merge: 4c4ecbf 3f3b1cc
Author: Mark Hoemmen <mhoemmen@users.noreply.github.com>
Date:   Mon Jan 21 13:12:15 2019 -0700

    Merge pull request #4224 from trilinos/Fix-4220
    
    MiniTensor: Attempt to fix #4220

This is output from white.

[kyukim @white11] atdm >  bsub -x -Is -q rhel7F -n 16 ctest -j16
***Forced exclusive execution
Job <43094> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on white22>>
Test project /ascldap/users/kyukim/Work/lib/trilinos/build/white/atdm
        Start   1: TpetraCore_Behavior_Default_MPI_4
        Start   2: TpetraCore_Behavior_Named_MPI_4
        Start   3: TpetraCore_Behavior_Off_MPI_4
        Start   4: TpetraCore_Behavior_On_MPI_4
  1/194 Test   #1: TpetraCore_Behavior_Default_MPI_4 ...........................................................   Passed    0.91 sec
        Start   5: TpetraCore_gemv_MPI_1
        Start   6: TpetraCore_gemm_m_eq_1_MPI_1
        Start   7: TpetraCore_gemm_m_eq_2_MPI_1
        Start   8: TpetraCore_gemm_m_eq_5_MPI_1
  2/194 Test   #2: TpetraCore_Behavior_Named_MPI_4 .............................................................   Passed    0.92 sec
        Start   9: TpetraCore_gemm_m_eq_13_MPI_1
        Start  11: TpetraCore_BlockMultiVector2_MPI_1
        Start  14: TpetraCore_BlockView_MPI_1
        Start  15: TpetraCore_BlockOps_MPI_1
  3/194 Test   #3: TpetraCore_Behavior_Off_MPI_4 ...............................................................   Passed    0.93 sec
        Start  10: TpetraCore_BlockMultiVector_MPI_4
  4/194 Test   #4: TpetraCore_Behavior_On_MPI_4 ................................................................   Passed    0.96 sec
        Start  12: TpetraCore_BlockCrsMatrix_MPI_4
  5/194 Test  #15: TpetraCore_BlockOps_MPI_1 ...................................................................   Passed    1.39 sec
        Start  16: TpetraCore_BlockExpNamespace_MPI_1
  6/194 Test  #14: TpetraCore_BlockView_MPI_1 ..................................................................   Passed    2.53 sec
        Start  31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1
  7/194 Test  #11: TpetraCore_BlockMultiVector2_MPI_1 ..........................................................   Passed    2.83 sec
        Start  32: TpetraCore_Core_ScopeGuard_where_tpetra_initializes_kokkos_MPI_1
  8/194 Test  #10: TpetraCore_BlockMultiVector_MPI_4 ...........................................................   Passed    2.82 sec
        Start  13: TpetraCore_BlockMap_MPI_4
  9/194 Test  #16: TpetraCore_BlockExpNamespace_MPI_1 ..........................................................   Passed    1.63 sec
        Start  33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1
 10/194 Test  #31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1 ............................   Passed    0.81 sec
        Start  34: TpetraCore_Core_ScopeGuard_where_user_initializes_kokkos_MPI_1
 11/194 Test  #33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1 ..............................   Passed    1.17 sec
        Start  39: TpetraCore_issue_434_already_initialized_MPI_1
 12/194 Test  #12: TpetraCore_BlockCrsMatrix_MPI_4 .............................................................   Passed    4.19 sec
...
100% tests passed, 0 tests failed out of 194

Subproject Time Summary:
Tpetra    = 1492.44 sec*proc (194 tests)

Total Test time (real) =  94.37 sec

@fryeguy52
Copy link
Contributor Author

@kyungjoo-kim Thanks for looking into this. I will try and reproduce and watch what it does tonight in testing.

@bartlettroscoe
Copy link
Member

@kyungjoo-kim and @fryeguy52,

I just logged onto 'white' really quick and pulled Trilinos 'develop' as of version 3ef91e9 :

3ef91e9 "Merge Pull Request #4253 from trilinos/Trilinos/Fix-4234"
Author: trilinos-autotester <trilinos-autotester@trilinos.org>
Date:   Thu Jan 24 08:15:36 2019 -0700 (7 hours ago)

and following the instructions here I ran:

$ bsub -x -I -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-release-debug --enable-packages=TpetraCore --local-do-all 

and it returned:

FAILED (NOT READY TO PUSH): Trilinos: white26

Thu Jan 24 15:06:51 MST 2019

Enabled Packages: TpetraCore

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT_OPENMP => Test case MPI_RELEASE_DEBUG_SHARED_PT_OPENMP was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-9.2-gnu-7.2.0-release-debug => FAILED: passed=193,notpassed=1 => Not ready to push! (16.65 min)


REQUESTED ACTIONS: FAILED

The detailed test results showed:

$ grep -A 100 "failed out of" cuda-9.2-gnu-7.2.0-release-debug/ctest.out 
99% tests passed, 1 tests failed out of 194

Subproject Time Summary:
Tpetra    = 1094.70 sec*proc (194 tests)

Total Test time (real) = 139.04 sec

The following tests FAILED:
         12 - TpetraCore_BlockCrsMatrix_MPI_4 (Failed)
Errors while running CTest

@kyungjoo-kim
Copy link
Contributor

I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.

@bartlettroscoe
Copy link
Member

@kyungjoo-kim said:

I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.

Let's wait and see if @fryeguy52 can reproduce this on 'white' in his own account and go from there.

@kddevin
Copy link
Contributor

kddevin commented Jan 29, 2019

Note that #4293 disabled this test; we'll need to re-enable it when this work is complete.
TpetraCore_BlockCrsMatrix_MPI_4_DISABLE

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jan 29, 2019

@kddevin said:

Note that #4293 disabled this test; we'll need to re-enable it when this work is complete.
TpetraCore_BlockCrsMatrix_MPI_4_DISABLE

PR #4293 only disables that test for the CUDA PR build, not the ATDM Trilinos builds. (There is no relationship between these two sets of builds and that is on purpose.)

The question is if this failing test is something that should be fixed or not before that ATDM APPs get an updated version of Trilinos? Right now it is listed as ATDM Sev: Blocker. (But my guess is that EMPIRE is not getting impacted by this because we would have heard about.) Is this a real defect in Tpetra or just a problem with the test?

@kyungjoo-kim
Copy link
Contributor

From the failed test message,

  unpackRow: numBytesOut = 4 != numBytes = 156.
 [FAILED]  (0.106 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
 Location: /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859
 
[waterman3:115046] *** An error occurred in MPI_Allreduce
[waterman3:115046] *** reported by process [3797417985,0]
[waterman3:115046] *** on communicator MPI_COMM_WORLD
[waterman3:115046] *** MPI_ERR_OTHER: known error not in list
[waterman3:115046] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[waterman3:115046] ***    and potentially your MPI job)

I first examine the packing and unpacking routine if there is some mistake in dualview sync as the input array of the unpacking, imports, is all zeros. Then, I see the All reduce error. I am not sure which trigger which error. However, it is also possible that MPI_Allreduce gets an error and it causes corruption in the importer. The other way is also possible. Something is not synced from device and it causes the MPI error. From the other test in BlockCrs, creating a map (this is not related to blockcrs code but it just happens in the blockcrs code) also fails with mpi all reduce.

@kyungjoo-kim
Copy link
Contributor

@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?

@bartlettroscoe
Copy link
Member

@kyungjoo-kim said:

@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?

Thanks for fix!

Someone will need to revert PR #4293 after we confirm that these tests are fixed in the ATDM build (where this tests was never disabled, PR #4293 only disabled them in the Trilinos PR build controlled by the @trilinos/framework team).

@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 5, 2019

@kyungjoo-kim Can we reenable that test now?

@bartlettroscoe
Copy link
Member

With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4 seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. See table below.

I will leave this open until the CUDA PR testing gets this test enabled again by reverting PR #4293.


Tests with issue trackers Passed: twip=6 (Testing day 2019-02-05)

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
waterman Trilinos-atdm-waterman-cuda-9.2-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 17 #4257
waterman Trilinos-atdm-waterman-cuda-9.2-opt TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 17 #4257
waterman Trilinos-atdm-waterman-cuda-9.2-release-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 15 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 12 20 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 10 17 #4257
white Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release TpetraCore_­BlockCrsMatrix_­MPI_­4 Passed Completed 1 11 18 #4257

@bartlettroscoe
Copy link
Member

FYI: I created the revert PR #4326 to re-enable this test in the Trilinos CUDA PR build. Just need someone to approve this PR and get it merged. Then we can close this issue.

@mhoemmen
Copy link
Contributor

mhoemmen commented Feb 5, 2019

Thanks Ross! I just approved the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants