Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: Invalid access of k_rowPtrs_ inside CrsGraph #8286

Closed
tasmith4 opened this issue Oct 30, 2020 · 19 comments
Closed

Tpetra: Invalid access of k_rowPtrs_ inside CrsGraph #8286

tasmith4 opened this issue Oct 30, 2020 · 19 comments
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests UVM removal

Comments

@tasmith4
Copy link
Contributor

Bug Report

@trilinos/tpetra

Description

When running the CrsGraph unit tests, Kokkos errors with "attempt to access inaccessible memory space" from line 1590 in TpetraCore_CrsGraph_def.hpp. This is an attempt to index into this->k_rowPtrs_ which is a view templated on device_type.

Steps to Reproduce

  1. Configure Tpetra for Cuda with -DKokkos_ENABLE_CUDA_UVM=OFF and -DTpetra_ENABLE_CUDA_UVM=OFF
  2. Build
  3. Run ctest -R "TpetraCore_CrsGraph_UnitTests0_MPI_4"
@tasmith4 tasmith4 added type: bug The primary issue is a bug in Trilinos code or tests pkg: Tpetra UVM removal labels Oct 30, 2020
@tasmith4
Copy link
Contributor Author

@trilinos/tpetra the solutions that come to mind are:

  1. Make k_rowPtrs_ a HostMirror.
  2. Use kernel launches to handle the logic for k_rowPtrs_.
  3. Copy back to host to do things with k_rowPtrs_

What are the performance/usability implications for the simplest solution of making it a HostMirror? Are there other preferable solutions that I'm missing?

@brian-kelley
Copy link
Contributor

@tasmith4 This function doesn't need to read from the whole view, justk_rowPtrs_(myRow) and k_rowPtrs_(myRow + 1) for packed 1D storage, or k_numRowEntries_(myRow) for unpacked 1D (the condition is just k_numAllocPerRow_.extent (0) is 0 for packed, otherwise unpacked). So I think the best option would be to just copy the entries you need, for packed:

auto myRowOffsets = Kokkos::subview(k_rowPtrs_, Kokkos::make_pair(myRow, myRow + 2));
auto myRowOffsetsHost = Kokkos::create_mirror_view_and_copy(myRowOffsets);
//offset1D is then myRowOffsetsHost(0), row length is myRowOffsetsHost(1) - offset1D

and for unpacked it would be similar but the subview would just be for the single index myRow, not a length-2 range.

@tasmith4
Copy link
Contributor Author

@brian-kelley definitely seems like a better solution than copying the whole array -- but why does it need to live on device in the first place?

@brian-kelley
Copy link
Contributor

@tasmith4 Ah I figured it out, it's never supposed to be accessed on device. Its type is this in CrsGraph_decl:

2251     typedef typename Kokkos::View<size_t*, Kokkos::LayoutLeft, device_type>::HostMirror num_row_entries_type;

So you will never run into this issue like you are with k_rowPtrs_.

@brian-kelley
Copy link
Contributor

brian-kelley commented Oct 30, 2020

Hmm, its type might as well be HostSpace rather than a HostMirror, since there's no reason for it to be UVM in a UVM build.

@tasmith4
Copy link
Contributor Author

Yeah I wasn't sure what the most general solution is since HostMirror isn't always HostSpace. Maybe we have a special node that wants the host side to be a particular LogicalMemorySpace? Not sure how much we should care about that

@brian-kelley
Copy link
Contributor

brian-kelley commented Oct 30, 2020

Tbh I think k_numRowEntries_ was always intended to be HostSpace, not HostMirror. For example, Mark left this comment, which definitely seems to think he meant it to not be CudaUVMSpace (i.e. MirrorView of CudaUVMSpace)

2030         // NOTE (mfh 08 May 2017) This is a host View, so it does not assume UVM.
2031         this->k_numRowEntries_(rowInfo.localRow) = newNumEnt;

@tasmith4
Copy link
Contributor Author

No I agree with that. I'm talking about k_rowPtrs_

@brian-kelley
Copy link
Contributor

k_rowPtrs_ definitely needs to live primarily on device, since that views the same memory as lclGraph_.row_map which is used in every performance-critical device thing.

@tasmith4
Copy link
Contributor Author

so k_rowPtrs_ is actually device pointers? Seems then that it doesn't make sense to copy them to host.

@brian-kelley
Copy link
Contributor

brian-kelley commented Oct 30, 2020

Once a matrix is fill-complete and in packed storage, I don't think any performance sensitive code should be using getRowInfo(row). This is OK as grepping through all of Trilinos, it seems to only be used inside CrsGraph/CrsMatrix for insertion into unpacked1D storage, which means it's not needing to access k_rowPtrs_.

Edit: the one exception seems to be transformLocalValues/transformGlobalValues. But these are not used at all in Trilinos either.

@cgcgcg
Copy link
Contributor

cgcgcg commented Oct 30, 2020

Just FYI, we have encountered this error in Galeri and the Tpetra matrix reader when insertGlobal is used. @lucbv is writing a deepcopy for matrices to work around this issue in the short term..

@tasmith4
Copy link
Contributor Author

@cgcgcg thanks for the heads-up.

@brian-kelley that wasn't my point. Seems to me that a pointer to device memory is invalid on host? But honestly I am a little confused about whether k_rowPtrs_ is indices into another view, or actual pointers to memory like I would expect from that name.

@brian-kelley
Copy link
Contributor

brian-kelley commented Oct 30, 2020

That array in CRS format can be called "row pointers" (like in Tpetra), "row map" (in Kokkos/KokkosKernels, which is not related to what Tpetra calls "row map") or "row offsets". These all mean the same thing, but they're not literally pointers to memory addresses. They are integer indices into lclGraph.entries, starting at 0 and going up to numLocalEntries.

I think "row pointers" is the best name for it in Tpetra because of the ambiguity with "row map".

@tasmith4
Copy link
Contributor Author

ok that makes sense. I think you're right then, when we need one of the row pointers we can just copy the individual one back.

@kddevin
Copy link
Contributor

kddevin commented Oct 30, 2020

Tbh I think k_numRowEntries_ was always intended to be HostSpace, not HostMirror. For example, Mark left this comment, which definitely seems to think he meant it to not be CudaUVMSpace (i.e. MirrorView of CudaUVMSpace)

2030         // NOTE (mfh 08 May 2017) This is a host View, so it does not assume UVM.
2031         this->k_numRowEntries_(rowInfo.localRow) = newNumEnt;

@kyungjoo-kim said that use of HostMirror was not good; it should just be Host.

@tasmith4
Copy link
Contributor Author

So actually I just found Tpetra::Details::getEntryOnHost -- seems like we can just replace this->k_rowPtrs_(index) with Tpetra::Details::getEntryOnHost(this->k_rowPtrs_, index) -- are there any downsides to this?

@brian-kelley
Copy link
Contributor

@tasmith4 You don't just need k_rowPtrs_(index), you also need k_rowPtrs_(index + 1) because the difference between them is the row length. Using a length-2 subview lets you get both values in one transfer.

@kddevin
Copy link
Contributor

kddevin commented Jun 29, 2021

Fixed through UVM removal refactoring

@kddevin kddevin closed this as completed Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests UVM removal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants