Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random segfault in MergeEntries kernel #783

Closed
cgcgcg opened this issue Aug 12, 2020 · 39 comments
Closed

Random segfault in MergeEntries kernel #783

cgcgcg opened this issue Aug 12, 2020 · 39 comments

Comments

@cgcgcg
Copy link
Contributor

cgcgcg commented Aug 12, 2020

EMPIRE has recently been experiencing random segfaults "Warp Illegal Address" on Cuda (vortex and lassen). We narrowed the location down to

MergeEntriesFunctor<size_type, ordinal_type, alno_row_view_t_, blno_row_view_t_, clno_row_view_t_, clno_nnz_view_t_>
mergeEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound, c_rowmap, c_entries_uncompressed, ab_perm, a_pos, b_pos);
Kokkos::parallel_for("KokkosSparse::SpAdd:Symbolic::InputNotSorted::MergeEntries", range_type(0, nrows), mergeEntries);

For debugging, we run these two lines in a loop, adding a global MPI collective at the end. After certain number of these calls go through, a segfault occurs. 100 iterations are typically enough to reliably get to the segfault.

It seems that the kernel

{
size_type CrowStart = Crowptrs(i);
size_type CrowEnd = Crowptrs(i + 1);
size_type ArowStart = Arowptrs(i);
size_type ArowNum = Arowptrs(i + 1) - ArowStart;
size_type BrowStart = Browptrs(i);
ordinal_type CFit = 0; //counting through merged C indices (within row)
for(size_type Cit = CrowStart; Cit < CrowEnd; Cit++)
{
size_type permVal = ABperm(Cit);
if(permVal < ArowNum)
{
//Entry belongs to A
ordinal_type Aindex = permVal;
//The Aindex'th entry in row i of A will be added into the CFit'th entry in C
Apos(ArowStart + Aindex) = CFit;
}
else
{
//Entry belongs to B
ordinal_type Bindex = permVal - ArowNum;
//The Bindex'th entry in row i of B will be added into the CFit'th entry in C
Bpos(BrowStart + Bindex) = CFit;
}
//if NOT merging uncompressed entries Cit and Cit + 1, increment compressed index CFit
bool mergingWithNext = Cit < CrowEnd - 1 && Ccolinds(Cit) == Ccolinds(Cit + 1);
if(!mergingWithNext)
CFit++;
}
//at end of the row, know how many entries are in merged C
Crowcounts(i) = CFit;
if(i == nrows - 1)
Crowcounts(nrows) = 0;
}

reads several arrays and writes several arrays, but no array is read-write. In particular, the arrays used for determining access locations are read-only.

Adding a print statement to the functor that prints when BrowStart + Bindex is out of the row never prints, and all kernel calls go through without a segfault.

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 12, 2020

@bathmatt @nmhamster @kddevin

@jhux2
Copy link

jhux2 commented Aug 12, 2020

@srajama1 @brian-kelley

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 12, 2020

Here is what I'm running with now, with added fence, memory_fence and prefetch. The observed behavior is still the same.

{
        int rank;
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);

        MergeEntriesFunctor<size_type, ordinal_type, alno_row_view_t_, blno_row_view_t_, clno_row_view_t_, clno_nnz_view_t_>
          mergeEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound, c_rowmap, c_entries_uncompressed, ab_perm, a_pos, b_pos);

        Kokkos::DefaultExecutionSpace space;
        size_t bytes;

        auto ptr_a_rowmap = a_rowmap.data();
        bytes = a_rowmap.span() * sizeof(typename alno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_b_rowmap = b_rowmap.data();
        bytes = b_rowmap.span() * sizeof(typename blno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_rowmap_upperbound = c_rowmap_upperbound.data();
        bytes = c_rowmap_upperbound.span() * sizeof(typename clno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap_upperbound,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_rowmap = c_rowmap.data();
        bytes = c_rowmap.span() * sizeof(typename clno_row_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_c_entries_uncompressed = c_entries_uncompressed.data();
        bytes = c_entries_uncompressed.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_entries_uncompressed,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_ab_perm = ab_perm.data();
        bytes = ab_perm.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_ab_perm,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_a_pos = a_pos.data();
        bytes = a_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_pos,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));

        auto ptr_b_pos = b_pos.data();
        bytes = b_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
        CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_pos,
                                            bytes,
                                            space.cuda_device(),
                                            space.cuda_stream()));
        
        for (int k = 0; k<100; k++) {
          if (rank == 0)
            std::cout << "iteration " << k << std::endl;
          Kokkos::fence();
          Kokkos::memory_fence();
          Kokkos::parallel_for("KokkosSparse::SpAdd:Symbolic::InputNotSorted::MergeEntries", range_type(0, nrows), mergeEntries);
          Kokkos::fence();
          Kokkos::memory_fence();
          MPI_Barrier(MPI_COMM_WORLD);
        }
      }

@brian-kelley
Copy link
Contributor

@cgcgcg Still the same as in succeeds a few times before failing?

@crtrott
Copy link
Member

crtrott commented Aug 12, 2020

Wow lol ... that is unexpected. I thought if you added that much crud it would fail reliably or never. So it still doesn't fail on the first iteration?

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 12, 2020

Yes, on one run on iteration 2, on the next run on iteration 10.

@brian-kelley
Copy link
Contributor

🤯

@crtrott
Copy link
Member

crtrott commented Aug 12, 2020

ok next thing: read all the ptrs of all the views, and make sure that they do not overlap (i mean I would assume that nothign is aliasing but who knows).

@jhux2
Copy link

jhux2 commented Aug 12, 2020

So in theory, then, this should fail as a standalone test, no?

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 13, 2020

Looking at the EMPIRE code, it seems that this is coming out of the second SpAdd. We have never seen it happening in the first. (And I'm doing the looping only for the second one.)

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

in theory it either fails on the first iteration or never ...

Clearly theory is wrong ...

(assuming no aliasing data ranges, or out of bounds writes)

@jhux2
Copy link

jhux2 commented Aug 13, 2020

(assuming no aliasing data ranges, or out of bounds writes)

@cgcgcg Can you turn on array bounds checking just in that file?

@jhux2
Copy link

jhux2 commented Aug 13, 2020

(assuming no aliasing data ranges, or out of bounds writes)

@cgcgcg Can you turn on array bounds checking just in that file?

Oh wait, you can't -- it's all of Kokkos, nvm.

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 13, 2020

The arrays are non-overlapping, checked on a rank that segfaulted.

@srajama1
Copy link
Contributor

Does EMPIRE use of two spadd related ? Meaning they pass the same handle between the two, one matrix is result of previous add , any other connection between the two ?

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

even so: there can't possibly be anything else running still, since there are global fences issues. Nothing should be able to get past those, which means the only thing running is that one parallel_for kernel

@srajama1
Copy link
Contributor

I am worried something is reused. For e.g. handle should not be reused as we might have kernel specific things there.

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 13, 2020

@srajama1
Copy link
Contributor

ok, that looks right !

@srajama1
Copy link
Contributor

Just remembered this very old issue

Old links
kokkos/kokkos#628
kokkos/kokkos#1789

Do we use any virtual base classes ? Pretty sure Tpetra uses them starting as DistObject ? Could this be a problem some vpointer not working.

@e10harvey
Copy link
Contributor

Is there any pattern to the addresses that are segfaulting? Can we map the addresses that segfault to the nearest SpAdd argument and chase that code path down into SpAdd for clues? Another idea is to initialize all memory with the memory's location or 0xdeadbeef, if possible.

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 13, 2020

@e10harvey The way we are calling the kernel, the memory addresses of the arrays shouldn't change, right?

@cgcgcg
Copy link
Contributor Author

cgcgcg commented Aug 13, 2020

@srajama1 It looks to me like everything going into the functor are views and an integer though. The functor itself isn't derived either. Could this still affect us?

@srajama1
Copy link
Contributor

@cgcgcg I never understood this level of the internals properly. I posted this issue because of two/three potential issues.

  1. All the *rowptrsT and ABperm are const. There was a comment on the linked issue with const. I am not sure if it will fail the first time or not.
  2. There is this comment "It is not allowed to pass as an argument to a global function an object of a class derived from virtual base classes." Not sure where the views are actually held.
  3. There is this comment "When the closure is small Kokkos passes the closure by value to the global dispatch function. When the closure is large it is copied to CUDA constant memory, which is likely to have similar problems."

#3 is the part that most worries me. Are we trying to copy too much too fast before some cleanup comes in. This is especially out of my comfort zone. @crtrott Any way to eliminate item 3 ? A sleep as the first line of k=0...99 loop ? If there is an option in Kokkos where we turn of this copy to constant memory we could try it as well.

@jhux2
Copy link

jhux2 commented Aug 13, 2020

@srajama1 So "When the closure is small" is referring to the size of the data captured?

@brian-kelley
Copy link
Contributor

@jhux2 Yes for a lambda, or just sizeof(functor)

@jhux2
Copy link

jhux2 commented Aug 13, 2020

@brian-kelley Thanks for the clarification. If data size were the problem, shouldn't running on fewer ranks increase the chance of this happening? @cgcgcg reported that running on fewer ranks made the error harder to manifest. (This was prior to him putting a loop around MergeEntries.)

@brian-kelley
Copy link
Contributor

@jhux2 The functors just contain integers and Views, which are just the pointer/size/span metadata and are fixed size. I could print out the sizeof(MergeEntriesFunctor<...>) to give a specific number, to comapre to the threshold for <<< >>> arg vs. constant cache.

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

What is the nvprof output? Is it reporting us of local or constant dispatch (part of the kernel name nvprof returns). If you want to force it as an argument to the global function you can do this:

 Kokkos::parallel_for(
      "Label",
      Kokkos::Experimental::require(
          Kokkos::RangePolicy<ExecSpace>(0, N),
          Kokkos::Experimental::WorkItemProperty::HintLightWeight),
      functor);

Also there isn't any inherticane going on or? So no base class stuff.

@brian-kelley
Copy link
Contributor

@crtrott Do I need to have built with -G for that?

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

no

@e10harvey
Copy link
Contributor

@e10harvey The way we are calling the kernel, the memory addresses of the arrays shouldn't change, right?

I am not familiar with how the memory is allocated and what other processes are running when the code segfaults. Regardless of whether the addresses are changing or not, there could be a pattern as to which offsets into a given argument result in segfaults -- which could provide further insight into where to investigate.

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

@e10harvey the memory addresses of the arrays can't change. There is no mechanism which would do that.

@srajama1
Copy link
Contributor

@brian-kelley @cgcgcg This is something worth eliminating.

@jhux2
Copy link

jhux2 commented Aug 13, 2020

@brian-kelley @cgcgcg This is something worth eliminating.

Which "this"?

@srajama1
Copy link
Contributor

I meant item no 3 in my list above.

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

One thing you can try to dump arrays after an error. Instead of Kokkos::fence() do this:

  auto e = cudaDeviceSynchronize());
  if(e!=cudaSuccess) {
    std::cout << name << " error( " << cudaGetErrorName(e)  << "): " << cudaGetErrorString(e);
    dump_all_my_arrays_of_the_functor(functor,file);
   // Exit or throw or whatever
 }

@crtrott
Copy link
Member

crtrott commented Aug 13, 2020

Any updates?

@brian-kelley
Copy link
Contributor

Closing this, tracked down to a compiler or runtime bug that only appears in CUDA 10.1. STK MPI issue was not related. New version of spadd (develop branch of KokkosKernels and Trilinos) isn't affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants