Random segfault in MergeEntries kernel #783

cgcgcg · 2020-08-12T20:56:35Z

EMPIRE has recently been experiencing random segfaults "Warp Illegal Address" on Cuda (vortex and lassen). We narrowed the location down to

kokkos-kernels/src/sparse/KokkosSparse_spadd.hpp

Lines 417 to 419 in 07a60bc

    
           MergeEntriesFunctor<size_type, ordinal_type, alno_row_view_t_, blno_row_view_t_, clno_row_view_t_, clno_nnz_view_t_> 
        
             mergeEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound, c_rowmap, c_entries_uncompressed, ab_perm, a_pos, b_pos); 
        
           Kokkos::parallel_for("KokkosSparse::SpAdd:Symbolic::InputNotSorted::MergeEntries", range_type(0, nrows), mergeEntries);

For debugging, we run these two lines in a loop, adding a global MPI collective at the end. After certain number of these calls go through, a segfault occurs. 100 iterations are typically enough to reliably get to the segfault.

It seems that the kernel

kokkos-kernels/src/sparse/KokkosSparse_spadd.hpp

Lines 283 to 316 in 07a60bc

    
           { 
        
             size_type CrowStart = Crowptrs(i); 
        
             size_type CrowEnd = Crowptrs(i + 1); 
        
             size_type ArowStart = Arowptrs(i); 
        
             size_type ArowNum = Arowptrs(i + 1) - ArowStart; 
        
             size_type BrowStart = Browptrs(i); 
        
             ordinal_type CFit = 0; //counting through merged C indices (within row) 
        
             for(size_type Cit = CrowStart; Cit < CrowEnd; Cit++) 
        
             { 
        
               size_type permVal = ABperm(Cit); 
        
               if(permVal < ArowNum) 
        
               { 
        
                 //Entry belongs to A 
        
                 ordinal_type Aindex = permVal; 
        
                 //The Aindex'th entry in row i of A will be added into the CFit'th entry in C 
        
                 Apos(ArowStart + Aindex) = CFit; 
        
               } 
        
               else 
        
               { 
        
                 //Entry belongs to B 
        
                 ordinal_type Bindex = permVal - ArowNum; 
        
                 //The Bindex'th entry in row i of B will be added into the CFit'th entry in C 
        
                 Bpos(BrowStart + Bindex) = CFit; 
        
               } 
        
               //if NOT merging uncompressed entries Cit and Cit + 1, increment compressed index CFit 
        
               bool mergingWithNext = Cit < CrowEnd - 1 && Ccolinds(Cit) == Ccolinds(Cit + 1); 
        
               if(!mergingWithNext) 
        
                 CFit++; 
        
             } 
        
             //at end of the row, know how many entries are in merged C 
        
             Crowcounts(i) = CFit; 
        
             if(i == nrows - 1) 
        
               Crowcounts(nrows) = 0; 
        
           }

reads several arrays and writes several arrays, but no array is read-write. In particular, the arrays used for determining access locations are read-only.

Adding a print statement to the functor that prints when BrowStart + Bindex is out of the row never prints, and all kernel calls go through without a segfault.

The text was updated successfully, but these errors were encountered:

cgcgcg · 2020-08-12T21:01:17Z

@bathmatt @nmhamster @kddevin

jhux2 · 2020-08-12T21:14:02Z

@srajama1 @brian-kelley

cgcgcg · 2020-08-12T23:50:27Z

Here is what I'm running with now, with added fence, memory_fence and prefetch. The observed behavior is still the same.

{
        int rank;
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);

        MergeEntriesFunctor<size_type, ordinal_type, alno_row_view_t_, blno_row_view_t_, clno_row_view_t_, clno_nnz_view_t_>
          mergeEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound, c_rowmap, c_entries_uncompressed, ab_perm, a_pos, b_pos);

        Kokkos::DefaultExecutionSpace space;
        size_t bytes;

        auto ptr_a_rowmap = a_rowmap.data();
        bytes = a_rowmap.span() * sizeof(typename alno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_b_rowmap = b_rowmap.data();
        bytes = b_rowmap.span() * sizeof(typename blno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_rowmap_upperbound = c_rowmap_upperbound.data();
        bytes = c_rowmap_upperbound.span() * sizeof(typename clno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap_upperbound,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_rowmap = c_rowmap.data();
        bytes = c_rowmap.span() * sizeof(typename clno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_entries_uncompressed = c_entries_uncompressed.data();
        bytes = c_entries_uncompressed.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_entries_uncompressed,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_ab_perm = ab_perm.data();
        bytes = ab_perm.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_ab_perm,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_a_pos = a_pos.data();
        bytes = a_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_pos,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_b_pos = b_pos.data();
        bytes = b_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_pos,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));
        
        for (int k = 0; k<100; k++) {
          if (rank == 0)
            std::cout << "iteration " << k << std::endl;
          Kokkos::fence();
          Kokkos::memory_fence();
          Kokkos::parallel_for("KokkosSparse::SpAdd:Symbolic::InputNotSorted::MergeEntries", range_type(0, nrows), mergeEntries);
          Kokkos::fence();
          Kokkos::memory_fence();
          MPI_Barrier(MPI_COMM_WORLD);
        }
      }

brian-kelley · 2020-08-12T23:52:34Z

@cgcgcg Still the same as in succeeds a few times before failing?

crtrott · 2020-08-12T23:53:44Z

Wow lol ... that is unexpected. I thought if you added that much crud it would fail reliably or never. So it still doesn't fail on the first iteration?

cgcgcg · 2020-08-12T23:54:35Z

Yes, on one run on iteration 2, on the next run on iteration 10.

brian-kelley · 2020-08-12T23:55:42Z

🤯

crtrott · 2020-08-12T23:56:54Z

ok next thing: read all the ptrs of all the views, and make sure that they do not overlap (i mean I would assume that nothign is aliasing but who knows).

jhux2 · 2020-08-12T23:59:36Z

So in theory, then, this should fail as a standalone test, no?

cgcgcg · 2020-08-13T00:02:34Z

Looking at the EMPIRE code, it seems that this is coming out of the second SpAdd. We have never seen it happening in the first. (And I'm doing the looping only for the second one.)

crtrott · 2020-08-13T00:12:57Z

in theory it either fails on the first iteration or never ...

Clearly theory is wrong ...

(assuming no aliasing data ranges, or out of bounds writes)

jhux2 · 2020-08-13T00:19:41Z

(assuming no aliasing data ranges, or out of bounds writes)

@cgcgcg Can you turn on array bounds checking just in that file?

jhux2 · 2020-08-13T00:20:15Z

(assuming no aliasing data ranges, or out of bounds writes)

@cgcgcg Can you turn on array bounds checking just in that file?

Oh wait, you can't -- it's all of Kokkos, nvm.

cgcgcg · 2020-08-13T01:42:16Z

The arrays are non-overlapping, checked on a rank that segfaulted.

srajama1 · 2020-08-13T03:16:12Z

Does EMPIRE use of two spadd related ? Meaning they pass the same handle between the two, one matrix is result of previous add , any other connection between the two ?

crtrott · 2020-08-13T03:18:01Z

even so: there can't possibly be anything else running still, since there are global fences issues. Nothing should be able to get past those, which means the only thing running is that one parallel_for kernel

srajama1 · 2020-08-13T03:41:32Z

I am worried something is reused. For e.g. handle should not be reused as we might have kernel specific things there.

cgcgcg · 2020-08-13T03:46:17Z

This is where the handle comes from
https://github.com/trilinos/Trilinos/blob/2761e293602a70e98441268b8e18bcfd81d38f8e/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_def.hpp#L3261-L3317

srajama1 · 2020-08-13T04:32:24Z

ok, that looks right !

srajama1 · 2020-08-13T08:51:19Z

Just remembered this very old issue

Old links
kokkos/kokkos#628
kokkos/kokkos#1789

Do we use any virtual base classes ? Pretty sure Tpetra uses them starting as DistObject ? Could this be a problem some vpointer not working.

e10harvey · 2020-08-13T13:31:13Z

Is there any pattern to the addresses that are segfaulting? Can we map the addresses that segfault to the nearest SpAdd argument and chase that code path down into SpAdd for clues? Another idea is to initialize all memory with the memory's location or 0xdeadbeef, if possible.

cgcgcg · 2020-08-13T14:56:19Z

@e10harvey The way we are calling the kernel, the memory addresses of the arrays shouldn't change, right?

cgcgcg · 2020-08-13T15:05:15Z

@srajama1 It looks to me like everything going into the functor are views and an integer though. The functor itself isn't derived either. Could this still affect us?

srajama1 · 2020-08-13T15:34:22Z

@cgcgcg I never understood this level of the internals properly. I posted this issue because of two/three potential issues.

All the *rowptrsT and ABperm are const. There was a comment on the linked issue with const. I am not sure if it will fail the first time or not.
There is this comment "It is not allowed to pass as an argument to a global function an object of a class derived from virtual base classes." Not sure where the views are actually held.
There is this comment "When the closure is small Kokkos passes the closure by value to the global dispatch function. When the closure is large it is copied to CUDA constant memory, which is likely to have similar problems."

#3 is the part that most worries me. Are we trying to copy too much too fast before some cleanup comes in. This is especially out of my comfort zone. @crtrott Any way to eliminate item 3 ? A sleep as the first line of k=0...99 loop ? If there is an option in Kokkos where we turn of this copy to constant memory we could try it as well.

jhux2 · 2020-08-13T16:03:58Z

@srajama1 So "When the closure is small" is referring to the size of the data captured?

brian-kelley · 2020-08-13T16:04:43Z

@jhux2 Yes for a lambda, or just sizeof(functor)

jhux2 · 2020-08-13T16:14:33Z

@brian-kelley Thanks for the clarification. If data size were the problem, shouldn't running on fewer ranks increase the chance of this happening? @cgcgcg reported that running on fewer ranks made the error harder to manifest. (This was prior to him putting a loop around MergeEntries.)

brian-kelley · 2020-08-13T16:19:30Z

@jhux2 The functors just contain integers and Views, which are just the pointer/size/span metadata and are fixed size. I could print out the sizeof(MergeEntriesFunctor<...>) to give a specific number, to comapre to the threshold for <<< >>> arg vs. constant cache.

crtrott · 2020-08-13T16:20:08Z

What is the nvprof output? Is it reporting us of local or constant dispatch (part of the kernel name nvprof returns). If you want to force it as an argument to the global function you can do this:

 Kokkos::parallel_for(
      "Label",
      Kokkos::Experimental::require(
          Kokkos::RangePolicy<ExecSpace>(0, N),
          Kokkos::Experimental::WorkItemProperty::HintLightWeight),
      functor);

Also there isn't any inherticane going on or? So no base class stuff.

brian-kelley · 2020-08-13T16:20:55Z

@crtrott Do I need to have built with -G for that?

crtrott · 2020-08-13T16:21:11Z

no

e10harvey · 2020-08-13T16:24:59Z

@e10harvey The way we are calling the kernel, the memory addresses of the arrays shouldn't change, right?

I am not familiar with how the memory is allocated and what other processes are running when the code segfaults. Regardless of whether the addresses are changing or not, there could be a pattern as to which offsets into a given argument result in segfaults -- which could provide further insight into where to investigate.

crtrott · 2020-08-13T16:26:58Z

@e10harvey the memory addresses of the arrays can't change. There is no mechanism which would do that.

srajama1 · 2020-08-13T17:01:30Z

@brian-kelley @cgcgcg This is something worth eliminating.

jhux2 · 2020-08-13T17:19:15Z

@brian-kelley @cgcgcg This is something worth eliminating.

Which "this"?

srajama1 · 2020-08-13T18:14:51Z

I meant item no 3 in my list above.

crtrott · 2020-08-13T19:51:06Z

One thing you can try to dump arrays after an error. Instead of Kokkos::fence() do this:

  auto e = cudaDeviceSynchronize());
  if(e!=cudaSuccess) {
    std::cout << name << " error( " << cudaGetErrorName(e)  << "): " << cudaGetErrorString(e);
    dump_all_my_arrays_of_the_functor(functor,file);
   // Exit or throw or whatever
 }

crtrott · 2020-08-13T21:21:52Z

Any updates?

brian-kelley · 2020-10-13T02:51:25Z

Closing this, tracked down to a compiler or runtime bug that only appears in CUDA 10.1. STK MPI issue was not related. New version of spadd (develop branch of KokkosKernels and Trilinos) isn't affected.

brian-kelley mentioned this issue Aug 14, 2020

STK: invalid read errors and crashes when run under valgrind trilinos/Trilinos#7840

Closed

brian-kelley mentioned this issue Sep 18, 2020

GCC/CUDA 10/Volta: possible compiler or kokkos bug (SpADD, originally found in EMPIRE) kokkos/kokkos#3401

Closed

brian-kelley closed this as completed Oct 13, 2020

Random segfault in MergeEntries kernel #783

Random segfault in MergeEntries kernel #783

Comments

cgcgcg commented Aug 12, 2020 • edited Loading

cgcgcg commented Aug 12, 2020 • edited Loading

jhux2 commented Aug 12, 2020

cgcgcg commented Aug 12, 2020

brian-kelley commented Aug 12, 2020

crtrott commented Aug 12, 2020 • edited Loading

cgcgcg commented Aug 12, 2020

brian-kelley commented Aug 12, 2020

crtrott commented Aug 12, 2020

jhux2 commented Aug 12, 2020

cgcgcg commented Aug 13, 2020

crtrott commented Aug 13, 2020 • edited Loading

jhux2 commented Aug 13, 2020

jhux2 commented Aug 13, 2020

cgcgcg commented Aug 13, 2020

srajama1 commented Aug 13, 2020

crtrott commented Aug 13, 2020

srajama1 commented Aug 13, 2020

cgcgcg commented Aug 13, 2020

srajama1 commented Aug 13, 2020

srajama1 commented Aug 13, 2020

e10harvey commented Aug 13, 2020

cgcgcg commented Aug 13, 2020

cgcgcg commented Aug 13, 2020

srajama1 commented Aug 13, 2020

jhux2 commented Aug 13, 2020

brian-kelley commented Aug 13, 2020

jhux2 commented Aug 13, 2020

brian-kelley commented Aug 13, 2020

crtrott commented Aug 13, 2020 • edited Loading

brian-kelley commented Aug 13, 2020

crtrott commented Aug 13, 2020

e10harvey commented Aug 13, 2020

crtrott commented Aug 13, 2020

srajama1 commented Aug 13, 2020

jhux2 commented Aug 13, 2020

srajama1 commented Aug 13, 2020

crtrott commented Aug 13, 2020

crtrott commented Aug 13, 2020

brian-kelley commented Oct 13, 2020

cgcgcg commented Aug 12, 2020 •

edited

Loading

cgcgcg commented Aug 12, 2020 •

edited

Loading

crtrott commented Aug 12, 2020 •

edited

Loading

crtrott commented Aug 13, 2020 •

edited

Loading

crtrott commented Aug 13, 2020 •

edited

Loading