-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize collision detection #235
Conversation
Codecov ReportBase: 92.12% // Head: 92.13% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #235 +/- ##
==========================================
+ Coverage 92.12% 92.13% +0.01%
==========================================
Files 32 32
Lines 3429 3421 -8
==========================================
- Hits 3159 3152 -7
+ Misses 270 269 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting, thanks!
src/collider/src/collider.cpp
Outdated
@@ -157,26 +157,28 @@ struct CreateRadixTree { | |||
} | |||
}; | |||
|
|||
template <typename T> | |||
template <typename T, const bool preallocate> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe allocateOnly
?
int traverse2 = RecordCollision(child2, query); | ||
if (traverse2 < 0) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no longer useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because RecordCollision
can no longer do early exit due to insufficient memory, it will calculate the number of elements and allocate exactly that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right; I'd forgotten about that. Thanks!
// actually recording collisions | ||
for_each_n( | ||
policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(), | ||
FindCollisions<T, false>({querryTri.ptrDpq(), counts.ptrD(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little surprising to me that running this function twice is faster than running it once simply because of removing the atomics. Would you mind doing a touch more perf comparison? I'd like to know how it compares for CPU and GPU, for meshes with lots of intersections (menger sponge is good) and few (our perf spheres are good).
int maxOverlaps = querriesIn.size() * 4; | ||
SparseIndices querryTri(maxOverlaps); | ||
int nOverlaps = 0; | ||
while (1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly nice to remove this; I use similar logic in LevelSet
; I'm curious if the same applies, or if it becomes a perf problem when the SDF function is computationally heavy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry forgot to address this. I think it should be possible, but will need some tests to see if there is significant performance improvement.
Regarding performance: I measured the last iteration kernel launch for the older version and each kernel launch for the newer version. diff --git a/src/collider/src/collider.cpp b/src/collider/src/collider.cpp
index c13f33d..9f0b816 100644
--- a/src/collider/src/collider.cpp
+++ b/src/collider/src/collider.cpp
@@ -21,6 +21,8 @@
#include <intrin.h>
#endif
+#include <chrono>
+
// Adjustable parameters
constexpr int kInitialLength = 128;
constexpr int kLengthMultiple = 4;
@@ -277,18 +279,45 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
auto policy = autoPolicy(querriesIn.size());
// compute the number of collisions to determine the size for allocation and
// offset, this avoids the need for atomic
+ auto start = std::chrono::high_resolution_clock::now();
for_each_n(policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(),
FindCollisions<T, true>(
{thrust::pair<int*, int*>(nullptr, nullptr), counts.ptrD(),
nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+ auto allocateEnd = std::chrono::high_resolution_clock::now();
// compute start index for each query and total count
- exclusive_scan(policy, counts.begin(), counts.end(), counts.begin());
+ exclusive_scan(ExecutionPolicy::Par, counts.begin(), counts.end(),
+ counts.begin());
SparseIndices querryTri(counts.back());
+ auto scanEnd = std::chrono::high_resolution_clock::now();
// actually recording collisions
for_each_n(
policy, zip(querriesIn.cbegin(), countAt(0)), querriesIn.size(),
FindCollisions<T, false>({querryTri.ptrDpq(), counts.ptrD(),
nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+ auto opEnd = std::chrono::high_resolution_clock::now();
+
+ std::cout << "First pass:"
+ << std::chrono::duration_cast<std::chrono::microseconds>(
+ allocateEnd - start)
+ .count()
+ << "us" << std::endl;
+ std::cout << "Exclusive scan: "
+ << std::chrono::duration_cast<std::chrono::microseconds>(
+ scanEnd - allocateEnd)
+ .count()
+ << "us" << std::endl;
+ std::cout << "Second pass: "
+ << std::chrono::duration_cast<std::chrono::microseconds>(opEnd -
+ scanEnd)
+ .count()
+ << "us" << std::endl;
+ std::cout << "Total: "
+ << std::chrono::duration_cast<std::chrono::microseconds>(opEnd -
+ start)
+ .count()
+ << "us" << std::endl
+ << std::endl;
return querryTri;
} and diff --git a/src/collider/src/collider.cpp b/src/collider/src/collider.cpp
index 9fd62f7..d9b1399 100644
--- a/src/collider/src/collider.cpp
+++ b/src/collider/src/collider.cpp
@@ -21,6 +21,8 @@
#include <intrin.h>
#endif
+#include <chrono>
+
// Adjustable parameters
constexpr int kInitialLength = 128;
constexpr int kLengthMultiple = 4;
@@ -271,15 +273,20 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
int maxOverlaps = querriesIn.size() * 4;
SparseIndices querryTri(maxOverlaps);
int nOverlaps = 0;
+
+ long us = 0;
while (1) {
// scalar number of overlaps found
VecDH<int> nOverlapsD(1, 0);
// calculate Bounding Box overlaps
+ auto start = std::chrono::high_resolution_clock::now();
for_each_n(
autoPolicy(querriesIn.size()), zip(querriesIn.cbegin(), countAt(0)),
querriesIn.size(),
FindCollisions<T>({querryTri.ptrDpq(), nOverlapsD.ptrD(), maxOverlaps,
nodeBBox_.ptrD(), internalChildren_.ptrD()}));
+ auto end = std::chrono::high_resolution_clock::now();
+ us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
nOverlaps = nOverlapsD[0];
if (nOverlaps <= maxOverlaps)
break;
@@ -293,6 +300,7 @@ SparseIndices Collider::Collisions(const VecDH<T>& querriesIn) const {
querryTri.Resize(maxOverlaps);
}
}
+ std::cout << "Last pass: " << us << "us" << std::endl;
// remove unused part of array
querryTri.Resize(nOverlaps);
return querryTri; Measurements: Measurement for
TLDR: The newer version is a bit slower for |
Reduces the number of atomic instructions needed and only perform allocation once. This improves the performance for complicated models by ~10%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the perf comparison; looks great!
int traverse2 = RecordCollision(child2, query); | ||
if (traverse2 < 0) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right; I'd forgotten about that. Thanks!
collider: improve performance Reduces the number of atomic instructions needed and only perform allocation once. This improves the performance for complicated models by ~10%.
Reduces the number of atomic instructions needed and only perform allocation once. This improves the performance for complicated models by ~10%. The idea is to perform the collision detection in 2 phases: the first one for determining the number of collisions for each query, the second one actually store the collision information. Exclusive scan is used for computing the total count and index offset after the first phase, so we can get unique indices without atomic.
(just thought of this when I am lying on bed, it turns out it really provides some performance improvement)