-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random segfault in MergeEntries kernel #783
Comments
Here is what I'm running with now, with added {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MergeEntriesFunctor<size_type, ordinal_type, alno_row_view_t_, blno_row_view_t_, clno_row_view_t_, clno_nnz_view_t_>
mergeEntries(nrows, a_rowmap, b_rowmap, c_rowmap_upperbound, c_rowmap, c_entries_uncompressed, ab_perm, a_pos, b_pos);
Kokkos::DefaultExecutionSpace space;
size_t bytes;
auto ptr_a_rowmap = a_rowmap.data();
bytes = a_rowmap.span() * sizeof(typename alno_row_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_rowmap,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_b_rowmap = b_rowmap.data();
bytes = b_rowmap.span() * sizeof(typename blno_row_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_rowmap,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_c_rowmap_upperbound = c_rowmap_upperbound.data();
bytes = c_rowmap_upperbound.span() * sizeof(typename clno_row_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap_upperbound,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_c_rowmap = c_rowmap.data();
bytes = c_rowmap.span() * sizeof(typename clno_row_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_rowmap,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_c_entries_uncompressed = c_entries_uncompressed.data();
bytes = c_entries_uncompressed.span() * sizeof(typename clno_nnz_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_c_entries_uncompressed,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_ab_perm = ab_perm.data();
bytes = ab_perm.span() * sizeof(typename clno_nnz_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_ab_perm,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_a_pos = a_pos.data();
bytes = a_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_a_pos,
bytes,
space.cuda_device(),
space.cuda_stream()));
auto ptr_b_pos = b_pos.data();
bytes = b_pos.span() * sizeof(typename clno_nnz_view_t_::value_type);
CUDA_SAFE_CALL(cudaMemPrefetchAsync(ptr_b_pos,
bytes,
space.cuda_device(),
space.cuda_stream()));
for (int k = 0; k<100; k++) {
if (rank == 0)
std::cout << "iteration " << k << std::endl;
Kokkos::fence();
Kokkos::memory_fence();
Kokkos::parallel_for("KokkosSparse::SpAdd:Symbolic::InputNotSorted::MergeEntries", range_type(0, nrows), mergeEntries);
Kokkos::fence();
Kokkos::memory_fence();
MPI_Barrier(MPI_COMM_WORLD);
}
} |
@cgcgcg Still the same as in succeeds a few times before failing? |
Wow lol ... that is unexpected. I thought if you added that much crud it would fail reliably or never. So it still doesn't fail on the first iteration? |
Yes, on one run on iteration 2, on the next run on iteration 10. |
🤯 |
ok next thing: read all the ptrs of all the views, and make sure that they do not overlap (i mean I would assume that nothign is aliasing but who knows). |
So in theory, then, this should fail as a standalone test, no? |
Looking at the EMPIRE code, it seems that this is coming out of the second SpAdd. We have never seen it happening in the first. (And I'm doing the looping only for the second one.) |
in theory it either fails on the first iteration or never ... Clearly theory is wrong ... (assuming no aliasing data ranges, or out of bounds writes) |
@cgcgcg Can you turn on array bounds checking just in that file? |
Oh wait, you can't -- it's all of Kokkos, nvm. |
The arrays are non-overlapping, checked on a rank that segfaulted. |
Does EMPIRE use of two spadd related ? Meaning they pass the same handle between the two, one matrix is result of previous add , any other connection between the two ? |
even so: there can't possibly be anything else running still, since there are global fences issues. Nothing should be able to get past those, which means the only thing running is that one parallel_for kernel |
I am worried something is reused. For e.g. handle should not be reused as we might have kernel specific things there. |
ok, that looks right ! |
Just remembered this very old issue Old links Do we use any virtual base classes ? Pretty sure Tpetra uses them starting as DistObject ? Could this be a problem some vpointer not working. |
Is there any pattern to the addresses that are segfaulting? Can we map the addresses that segfault to the nearest |
@e10harvey The way we are calling the kernel, the memory addresses of the arrays shouldn't change, right? |
@srajama1 It looks to me like everything going into the functor are views and an integer though. The functor itself isn't derived either. Could this still affect us? |
@cgcgcg I never understood this level of the internals properly. I posted this issue because of two/three potential issues.
#3 is the part that most worries me. Are we trying to copy too much too fast before some cleanup comes in. This is especially out of my comfort zone. @crtrott Any way to eliminate item 3 ? A sleep as the first line of k=0...99 loop ? If there is an option in Kokkos where we turn of this copy to constant memory we could try it as well. |
@srajama1 So "When the closure is small" is referring to the size of the data captured? |
@jhux2 Yes for a lambda, or just sizeof(functor) |
@brian-kelley Thanks for the clarification. If data size were the problem, shouldn't running on fewer ranks increase the chance of this happening? @cgcgcg reported that running on fewer ranks made the error harder to manifest. (This was prior to him putting a loop around MergeEntries.) |
@jhux2 The functors just contain integers and Views, which are just the pointer/size/span metadata and are fixed size. I could print out the sizeof(MergeEntriesFunctor<...>) to give a specific number, to comapre to the threshold for |
What is the nvprof output? Is it reporting us of local or constant dispatch (part of the kernel name nvprof returns). If you want to force it as an argument to the global function you can do this: Kokkos::parallel_for(
"Label",
Kokkos::Experimental::require(
Kokkos::RangePolicy<ExecSpace>(0, N),
Kokkos::Experimental::WorkItemProperty::HintLightWeight),
functor); Also there isn't any inherticane going on or? So no base class stuff. |
@crtrott Do I need to have built with -G for that? |
no |
I am not familiar with how the memory is allocated and what other processes are running when the code segfaults. Regardless of whether the addresses are changing or not, there could be a pattern as to which offsets into a given argument result in segfaults -- which could provide further insight into where to investigate. |
@e10harvey the memory addresses of the arrays can't change. There is no mechanism which would do that. |
@brian-kelley @cgcgcg This is something worth eliminating. |
Which "this"? |
I meant item no 3 in my list above. |
One thing you can try to dump arrays after an error. Instead of Kokkos::fence() do this:
|
Any updates? |
Closing this, tracked down to a compiler or runtime bug that only appears in CUDA 10.1. STK MPI issue was not related. New version of spadd (develop branch of KokkosKernels and Trilinos) isn't affected. |
EMPIRE has recently been experiencing random segfaults "Warp Illegal Address" on Cuda (vortex and lassen). We narrowed the location down to
kokkos-kernels/src/sparse/KokkosSparse_spadd.hpp
Lines 417 to 419 in 07a60bc
For debugging, we run these two lines in a loop, adding a global MPI collective at the end. After certain number of these calls go through, a segfault occurs. 100 iterations are typically enough to reliably get to the segfault.
It seems that the kernel
kokkos-kernels/src/sparse/KokkosSparse_spadd.hpp
Lines 283 to 316 in 07a60bc
reads several arrays and writes several arrays, but no array is read-write. In particular, the arrays used for determining access locations are read-only.
Adding a print statement to the functor that prints when
BrowStart + Bindex
is out of the row never prints, and all kernel calls go through without a segfault.The text was updated successfully, but these errors were encountered: