Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] mdspan integration. #437

Merged
merged 15 commits into from
Feb 22, 2022
Merged

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Jan 5, 2022

This is an early version of the integration, it doesn't have padding support, the file is placed under test directory. Opening a PR so that we can continue the design discussion with actual code we can reference.

For mdspan, it's currently using a fork of mine, but in general, the required changes are trivial so it's directly integrated as a cmake module. Padding will add some more complexity, but I think we can implement it as custom layouts.

For mdarray, due to the operator(), copy constructor, and CUDA stream, I copied the reference implementation with some substantial changes. (need to add license).

  • For the operator(), I have created a device_reference that's a simplified version of thrust::device_reference but supports the CUDA stream.
  • For supporting CUDA stream, I have removed most of the constructors to make sure there's a valid stream being passed into mdarray. Also, a new container policy is required to store the stream.
  • For the extent template parameter, I have changed the signature to match mdspan.
  • To integrate rmm allocator, I made a thin wrapper over rmm::device_uvector with a custom policy.
  • For the container policy, the ref implementation reuses container_policy for mdspan::accessor_policy. This is done by having multiple overloads of access methods, including reference access(container const& c) and reference access(pointer ptr), and the latter is a host device function. I have split it up into 2 policies as the container policy might have unwanted states for kernels like CUDA stream.

Lastly, I have added host and device varients of both mdarray and mdspan.

@cjnolet @divyegala @achirkin

Related:

Todos:

  • Finalize the design doc.
  • Add host/device accessor policy.
  • Add view type.
  • Change default accessor.
  • Add padding support.
  • Upstream changes to mdspan.
  • Add more tests.

@trivialfis trivialfis added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jan 5, 2022
@trivialfis trivialfis requested review from a team as code owners January 5, 2022 13:32
@trivialfis trivialfis added feature request New feature or request non-breaking Non-breaking change labels Jan 6, 2022
@ajschmidt8
Copy link
Member

Please consider using GitHub's Draft PR feature instead of WIP tags in the future. Draft PRs have the benefit of preventing notifications to codeowners until PRs are marked Ready for Review. CI will still run on Draft PRs.

Some useful information about Draft PRs:

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

@trivialfis trivialfis marked this pull request as draft January 11, 2022 04:23
@trivialfis
Copy link
Member Author

Thank you for sharing. Converted the PR to draft.

@jrhemstad
Copy link

IMO, it's less important to exactly mirror mdarray. mdspan is the important vocabulary type to use at interface boundaries.

Specifically, I have created a device_ref that's a simplified version of thrust::device_reference but supports the CUDA stream.

Do we really think this it is necessary/a good idea to have a host-side accessor to device memory that automatically copies D2H?

/**
* @brief Indexing operator, use it sparingly since it triggers a device<->host copy.
*/
template <typename... IndexType>
Copy link

@rg20 rg20 Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to throw a compiler warning when this function is used for device arrays?

Copy link
Member Author

@trivialfis trivialfis Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can limit it to debug build if necessary. I think it's fine though, this is similar to thrust::device_vector.

@trivialfis
Copy link
Member Author

@jrhemstad

IMO, it's less important to exactly mirror mdarray. mdspan is the important vocabulary type to use at interface boundaries.

Thanks for the suggestions.

Do we really think this it is necessary/a good idea to have a host-side accessor to device memory that automatically copies D2H?

For now, we just want to mimic the interface of mdarray as much as possible to avoid reinventing the wheel. The feature is not so important other than debugging purposes.

cpp/include/raft/linalg/mdarray.h Outdated Show resolved Hide resolved
cpp/include/raft/linalg/mdarray.h Outdated Show resolved Hide resolved
cpp/include/raft/linalg/mdarray.h Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

trivialfis commented Jan 13, 2022

Sorry for the force push, moved the file into raft/ and rebased onto the latest.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 13, 2022

Note to myself:

Build 11.0,driver-450,centos7,3.7 (Jan 12, 2022 10:06:24 PM)

Error: Internal Compiler Error (codegen): "there was an error in verifying the lgenfe output!"

11.2,driver-495,ubuntu18.04,3.8

05:06:31 i8** null
05:06:31 i8* %call14 = call i32 null(void (i8*)* null, i8* null, i8** null), !dbg !968
05:06:31 : parse Invalid record (Producer: 'LLVM7.0.1' Reader: 'LLVM 7.0.1')

11.5,driver-495,ubuntu20.04,3.8

21:09:53 ../test/mdarray.cu(39): error: identifier "__assert_fail" is undefined

Comment on lines 333 to 389
template <typename... IndexType>
auto operator()(IndexType&&... indices)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add an overload of operator() that takes a stream as it's first argument in order to provide stream-ordered access.

This should work without causing an ambiguous overload because cuda_stream_view is strongly typed.

  template <typename... IndexType>
  auto operator()(rmm::cuda_stream_view stream, IndexType&&... indices)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, Currently, the mdarray class leaves the cuda stream to container policy. Adding an extra stream parameter might cause conflict with the stream in the policy. I'm not entirely sure whether do we need an extra stream for accessing the operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, maybe we should add view() overloads that would accept a new container policy as an argument? So that it would be easy to construct different views per-stream.

auto view() const noexcept { return const_view_type(c_.data(), map_, cp_.view()); }

[[nodiscard]] constexpr auto size() const noexcept -> size_type { return this->view().size(); }

Copy link

@rg20 rg20 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize functionality would be helpful. It would be tricky to promise std::vector::resize kind of functionality though. If we resize to a larger size, keeping the values at lower indices the same as the old values is not trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be concerned about the resize method might violate the shape of the mdarray.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean resizing a 3-dimensional array with 2-dimensional map? If so, that can be made as a compilation error, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rg20 are you suggesting resize would allocate a larger / smaller chunk of device memory (like std::vector::resize, which changes the data shape) or are you suggesting a numpy reshape where we change the shape but the amount of memory stays the same?

Won't changing the amount of memory allocated will always have an impact on the resulting shape?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am talking more of std::vector::resize. Consider an example where I have an mdarray as a member variable but I would only initialize it much later (I can do this because there is a default constructor). Typically, we would just do resize whenever we need (for std::vector or rmm::device_uvector).

We can still manage with the existing implementation because we can just create a new 'mdarray' and just move it. My suggestion is more of making it look like other containers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing we coulda accomplish this with dynamic extents?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not realize that the extents in the current implementation are compile-time and so is the specification of C++ standard.

We could only do this using dynamic extents. Is there a particular reason (other than the standard specification) the extents have to be compile time?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason (other than the standard specification) the extents have to be compile time?

The spec is reason enough. It is that way because there are optimization benefits to extents being statically known when possible, e.g., for loop unrolling, vectorization, modulo optimization, etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the extent itself can be dynamic and is supported by the spec and this implementation. So when the extent is dynamic, we could allow resizing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rg20, I do agree with your general idea here. There are many times we allocate an empty array (especially in the case of sparse formats where we know we need 3 arrays but don't yet know the number of nonzeros) so we allocate an empty array and resize later.

@rg20
Copy link

rg20 commented Jan 26, 2022

@rg20 @trivialfis

Proper resizing will be more useful for the sparse case since they will always be vectors. One of the reasons this is important is for making sure the sparse format always has a consistent and valid state by always having its 3 mdarray instance variables be valid arrays, starting with a size of 0 and resizing once the number of nonzeros are known (by manipulating those instances directly instead of creating new instances). Even though the device memory underneath might change, the mdarray itself hides that from us and always presents usable memory up to the product of the extents.

Agreed!

I also think that a reshape functionality would be very useful. There are many cases where we may even reshape into a vector so we can perform some vector-wise operation on all the elements before reshaping back into a 2d matrix. If it's too hard to add reshaping to the mdspan without upstreaming the changes, we could also just use a helper function for this for now since the mdspan is so lightweight:

auto reshaped = raft::reshape(mdspan_obj, new_extents, stream);

Agreed! With reshape, we just need a runtime check if the total size is the same or not.

@trivialfis
Copy link
Member Author

trivialfis commented Jan 26, 2022

I will write a document for the requested features after the PR is finished. The PR at its current state should be sufficient for replacing pointer and rmm::device_uvector. Sorry for keeping the PR's scope to be smaller than desired. To me array indexing and slicing are difficult and error prone (that's why I want to eliminate the use of raw pointer), I need to have a clear description of things like what to do when reshaping a padded array before implementing them. My apologies for the lack of efficiency.

detail::matrix_extent extents{n_rows, n_cols};
return host_matrix_view<ElementType>{ptr, extents};
}
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide an option with a default here to allow for both c- and f-contiguous layout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the optional layout parameter along with a test showing how to see whether input is c/f-contiguous. Hope that helps.

Also, due to the implementation of submdspan, the returned slice has default_accessor instead of our host/device tagged accessor. I will investigate it further, for now, it will be a minor inconvenience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Though, I might suggest we expose a version of the layout to the public API so that users don't have to call into the detail namespace. If we can, it would also be nice to rename the public version to "layout::colmajor" and "layout::rowmajor" (maybe just a simple typedef or "using" would be enough?) to make it easier on the users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjnolet There are additional details in here.

I might suggest we expose a version of the layout to the public API so that users don't have to call into the detail namespace

Agreed, I will expose some more things from mdspan ref impl. This will replace stdex with raft. (no detail in here so I'm not sure if I understand you correctly).

If we can, it would also be nice to rename the public version to "layout::colmajor" and "layout::rowmajor"

This might be misleading. (correct me if I'm wrong) According to the design of stdex::mdspan, stdex::layout_right is c-contiguous, which implies row-major and contiguous. An array can be row-major but not contiguous. From this perspective, I think it might be better if we stick to the mdspan naming.

Copy link
Member Author

@trivialfis trivialfis Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your PR on using it for distance metrics. I think it would be better if we can pass the mdspan all the way down to the kernels and be careful around the extent and stride when using it with cublas. Converting a span directly to a pointer is dangerous.

Copy link
Member

@cjnolet cjnolet Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no detail in here so I'm not sure if I understand you correctly).

I'm referring to detail::stdex::layout_right here.

This might be misleading.

I'm proposing both to be used. Many developers are going to come into this API understanding things like row-/col-major and c/f-contiguity but not understanding the right vs left verbiage. We can cover both cases if we provide both here. We can constrain the API if needed, but if we plan to support non-contiguous layouts in the future, I agree the names could best reflect that.

I think it would be better if we can pass the mdspan all the way down to the kernels and be careful around the extent and stride when using it with cublas.

Thanks for taking a look at the PR! It was intended to initiate discussions like this. RAFT's primitives are pretty extensive at this point and re-writing all of the kernels up front to suit this new API would be an extremely massive undertaking. I understand the mdspan's underlying indexing can be very flexible but I'd like to take a little step back and think through ways we can provide this functionality incrementally (within RAFT) without rewriting the entire library. With contiguity assumed (as it is currently), the row- vs col-major cases seem like a pretty simple introduction to using the mdspan as a facade. Shouldn't we be able to constrain these cases by testing the layout (potentially against the size of the pointer)? I think as we expand the kernels to use the mdspan directly for indexing then the API will just become more flexible. For now, the entire RAFT is assuming pointers w/ contiguous layouts already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added type alias along with brief documents.

@cjnolet
Copy link
Member

cjnolet commented Jan 26, 2022

Sorry for keeping the PR's scope to be smaller than desired. To me array indexing and slicing are difficult and error prone (that's why I want to eliminate the use of raw pointer), I need to have a clear description of things like what to do when reshaping a padded array before implementing them. My apologies for the lack of efficiency.

@trivialfis, your work on the new array and span abstractions is critical to both the success of RAFT's new 22.04 API and many other projects across the C++ layer of the RAPIDS analytics stack. Since just about all of RAFT's primitives take raw pointers currently, I think initially we can expose what you've got so far to the public API while we continue to iterate and improve its use in the lower-level bits as well (as an example, the existing pairwise_distances function can go from pairwise_distances(handle, in_ptr, m, n, k, out_ptr, metric); to pairwise_distances(handle, in_mdspan, out_mdspan, metric); and can delegate to the existing function in the detail namespace).

Since the mdarray/mdspan/span are so far along, my suggestion would be to capture some of the more significant remaining features (e.g. padding, resizing, flexible indexing types) in Github issues.

cpp/include/raft/mdarray.hpp Show resolved Hide resolved
BUILD_EXPORT_SET raft-exports
INSTALL_EXPORT_SET raft-exports
CPM_ARGS
GIT_REPOSITORY https://github.com/trivialfis/mdspan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to consider whether the mdspan fork should be moved into an official rapidsai repository for the time being.

detail::matrix_extent extents{n_rows, n_cols};
return host_matrix_view<ElementType>{ptr, extents};
}
/**
Copy link
Member

@cjnolet cjnolet Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no detail in here so I'm not sure if I understand you correctly).

I'm referring to detail::stdex::layout_right here.

This might be misleading.

I'm proposing both to be used. Many developers are going to come into this API understanding things like row-/col-major and c/f-contiguity but not understanding the right vs left verbiage. We can cover both cases if we provide both here. We can constrain the API if needed, but if we plan to support non-contiguous layouts in the future, I agree the names could best reflect that.

I think it would be better if we can pass the mdspan all the way down to the kernels and be careful around the extent and stride when using it with cublas.

Thanks for taking a look at the PR! It was intended to initiate discussions like this. RAFT's primitives are pretty extensive at this point and re-writing all of the kernels up front to suit this new API would be an extremely massive undertaking. I understand the mdspan's underlying indexing can be very flexible but I'd like to take a little step back and think through ways we can provide this functionality incrementally (within RAFT) without rewriting the entire library. With contiguity assumed (as it is currently), the row- vs col-major cases seem like a pretty simple introduction to using the mdspan as a facade. Shouldn't we be able to constrain these cases by testing the layout (potentially against the size of the pointer)? I think as we expand the kernels to use the mdspan directly for indexing then the API will just become more flexible. For now, the entire RAFT is assuming pointers w/ contiguous layouts already.

@trivialfis
Copy link
Member Author

The one last thing to do for this PR is to decide how to handle the upstream changes. There are 2 commits specifically:

trivialfis/mdspan@0193f07: Fix a compiler issue.
trivialfis/mdspan@f18e694: Avoid size_t as index type.

I think the first one should be easy to upstream, not sure about the second one yet.

@cjnolet
Copy link
Member

cjnolet commented Feb 7, 2022

I think the first one should be easy to upstream, not sure about the second one yet

@trivialfis, I agree and I think this PR doesn't need to rely on the second item. I think we can open issues for the indexing types, padding, and resize/reshape capabilities as follow-on items and get this PR in after the compiler issue has been merged upstream. What do you think?

@trivialfis trivialfis requested review from a team as code owners February 14, 2022 19:28
@ajschmidt8
Copy link
Member

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Feb 18, 2022

rerun tests

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cjnolet
Copy link
Member

cjnolet commented Feb 22, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 57703c5 into rapidsai:branch-22.04 Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants