-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
universal vector #121
universal vector #121
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job working around thrust's bug. The improvements here in performance, memory, and code quality are huge! Let's clean it up a little and merge it.
VecDH<bool> wholeHalfedgeP(inP_.halfedge_.size(), true); | ||
VecDH<bool> wholeHalfedgeQ(inQ_.halfedge_.size(), true); | ||
VecDH<char> wholeHalfedgeP(inP_.halfedge_.size(), true); | ||
VecDH<char> wholeHalfedgeQ(inQ_.halfedge_.size(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose because CUDA doesn't support addressing single bits? Works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is due to std::vector<T>
is being used for the cache, and you know the problem with std::vector<bool>
* performance. | ||
* Note that it is *NOT SAFE* to first obtain a host(device) pointer, perform | ||
* some device(host) modification, and then read the host(device) pointer again | ||
* (on the same vector). The memory will be inconsistent in that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
And I'm thinking about if it is good to add a simple wrapper over some thrust functions like for_each, and choose whether to run it on the host or on the device. As we use unified memory now, it is safe to run them on the host if they can be run on the device. |
A wrapper for a dynamic backend could be cool. How would you choose between device and host? Length of vector? Let's save that for a follow-on PR. |
For dynamic backend, the idea is just to use an enum for the backend tag, and fallback to supported backend if the one specified is not available. It seems that thrust uses custom types for each execution policy, so we can't do this with their API, but have to write some templates to generate code for different backends and select the appropriate one. |
e0be1d5
to
dbbcdbb
Compare
One issue with the CUDA backend: allocation of It seems that it is generating a lot of page faults that requires IO. Investigating... |
#include <iostream> | ||
#include <thrust/universal_vector.h> | ||
#include <thrust/execution_policy.h> | ||
#include "structs.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need structs or iostream in here?
@@ -18,6 +18,7 @@ | |||
#include <thrust/remove.h> | |||
#include <thrust/sort.h> | |||
#include <thrust/unique.h> | |||
#include <thrust/execution_policy.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can remove this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, thank you!
universal vector
Use universal vector to implement
VecDH
. Closes #120. Gives some performance improvement, reduced memory overhead and allows GPU to do demand-paging, i.e. will not OOM when the GPU memory is insufficient.Note:
thrust::universal_vector<T>::push_back
is very slow, already filed an issue to the upstream. I workaround this problem by implementing a cache that is only used when we push elements to a vector or reserve memory.universal_vector
header. I haven't yet tried compiling using the thrust in submodule with an older version of CUDA, not sure if that is gonna work.Benchmark (note that my computer only has 8GB of RAM, and the
nTri=8388608
was using swap and very slow for previous version, so the improvement is not actually that high):