-
Notifications
You must be signed in to change notification settings - Fork 955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use thread_local var's in FCL distanceCallback() #2698
Use thread_local var's in FCL distanceCallback() #2698
Conversation
ff6868c
to
a161d78
Compare
3cf2fe7
to
0857454
Compare
Codecov Report
@@ Coverage Diff @@
## master #2698 +/- ##
==========================================
+ Coverage 60.62% 60.64% +0.02%
==========================================
Files 402 402
Lines 29615 29624 +9
==========================================
+ Hits 17952 17962 +10
+ Misses 11663 11662 -1
Continue to review full report at Codecov.
|
@v4hn I really appreciate the comments and the level of detail you put into them! I stopped using the bio_ik control pipeline to collect benchmarking data. It was too flaky -- too many moving pieces. Instead I dropped a std::chrono high-res clock right in the Results are good! Average duration drops from 8.74->4.22 microseconds! I think one of the main culprits is
Allocating a pair of strings is probably expensive! Spreadsheet attached and I'll update the PR description again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, saw this one because you referred to my PR ;)
Similar to @v4hn I'd be very cautious regarding this optimization. I'm not sure you kept all the behavior intact, since it's very easy to miss something (see the problems I found below, and there might be more).
All in all, stack variables are very cheap and should nearly always be cheaper than accessing the heap / static variables, since the stack should always be in the cache. Allocating on the stack is also very cheap, it's just incrementing the stack pointer. So in my mind there should be little reason to turn primitive types such as bool
or int
into thread_local
variables. (But who knows, computers are complicated beasts ^^)
For types with costly constructors there may be some argument, but I don't see those here. The std::pair
gets completely overwritten on every call anyways, so there shouldn't be any benefit.
Maybe we can find out (e.g. through trial and error) which change exactly makes it faster in your setup?
I can recommend https://github.com/KDAB/hotspot (open source) or Intel VTune (paid, but free for academics and such) to profile the original function in more detail and see which operations are actually costly in this function.
Additionally, I just looked at your measurements and in the "master" case the reported timings have an average of 4.13µs for the first half (which is same or below "this PR"), and in the second half they suddenly go up. That is somewhat suspicious - maybe something else was running at the same time? Maybe your CPU was throttling? Maybe the collision scene changed? Just for reference again, here is the small benchmark I wrote for the other PR: |
Thanks @xqms. I get it -- it will be tough to convince you guys these changes are necessary, and maybe I went overboard. Will try to benchmark in more detail using one of those methods you guys have described. Hopefully we all agree that it's not good practice to declare variables in functions that could be used in realtime control applications. Thus, something needs to be improved here. |
No worries! It's an interesting topic. I hope I didn't go overboard with my review ;)
I'm curious, why? Local variables on the stack have zero allocation time (the stack pointer is usually incremented once on function entry and this needs to be done anyway). So I'd see no problem with local variables in general. But maybe I'm missing something. If there is something the constructor of the variable is allocating on the heap and there is the possibility to re-use that space in the next call, then I guess there is some potential for improvement. I see that only for the |
This function is huge! I think you will get more rewarding results by analyzing the whole operation and looking for things that could be handled more efficiently on a higher level. Besides ´hotspot` which I really recommend (I used it to figure out the thread-local cache of FCL collision objects further down that file) you can also try an instrumenting profiler like perfetto https://perfetto.dev/ to get a time-resolved view of what's happening. From my memory there are some points worth looking at
Specifically in this function I also see that towards the end Also you haven't stated your high-level goal, if you want "pre-impact" collision checking (boolean checking if distance < threshold) in a real-time setup maybe we need a special variant of that callback function that exits early? Or maybe you can already achieve that with the flags? |
I'm grateful for your comments! There is no "going overboard" w.r.t. patches on runtime optimization. 👍
No I agree. As I said above, individual values will often just end up in registers and not even on the stack...
That's not just a potential improvement. Avoiding allocations is probably what @AndyZe means anyway. |
88060ec
to
5d665a1
Compare
5d665a1
to
4545f63
Compare
Well, I've now benchmarked this repeatedly with code that is very similar to what @xqms shared (attached). I took a quick look at So, benchmarking results for 10,000 iterations. This is distance checking for a panda arm that is in collision with a box. master: 1.81 ms/iteration The POD variables (double, bool) did not seem to make any difference, whether they were thread-local or not. So you guys were right in that regard. BTW @xqms, you may be interested to know that --
runs much faster than
It is counterintuitive that a GLOBAL check runs faster than a single check. |
Perhaps we can get this merged soon as a small, incremental improvement? |
You can simply download the AppImage, |
Thanks for the tip @simonschmeisser! hotspot is a nice tool. I started issue #2714 where we can really dive into the analysis. I suspect it's going to be a long discussion and I don't really want this PR to get dragged into the muck. |
The .clear() function is documented here: https://github.com/ros-planning/moveit/blob/master/moveit_core/collision_detection/include/moveit/collision_detection/collision_common.h Everything that the .clear() call does is already covered by the following lines.
Interesting! I think one reason could be that FCL can apply more early exits to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this change I think it is a nice improvement. We have tested this quite a bit on a customer project and it seems to work as intended.
@v4hn I requested a review from you as you requested changes in your review previously. I believe your comments have been addressed and Andy is correctly cleaning up the memory that is reused before using it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the semantic code changes. Formally, there are still some stack-allocatable data types declared as thread_local
.
Sorry, I did not spend more time on this yet and dismissed my stale review (you could have done that too @tylerjw). |
@rhaschke Thanks! Run time after your changes dropped a tiny little bit. I think it's because clearing the fcl::DistanceResultd wasn't necessary anymore. Runtime change: 0.113 ms/iteration --> 0.110 ms/iteration |
Let's take a minute to recap. For my test case, this PR drops runtime from 1.667 ms/iteration to 0.110 ms/iteration. That's a 93% drop. Wow! It makes me happy to think of all the saved CPU cycles. Thanks everybody. |
2cd721b
to
8848af0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I approve. See AndyZe#2 for some more improvements.
Please address the open issues, i.e. add a comment or a dist_result.clear()
Andy's update for moveit/moveit#2698
* wording * date update * add image * Update 2021-07-08-moveit-galactic.md here 1 > here * Update _posts/2021-07-08-moveit-galactic.md Co-authored-by: AndyZe <zelenak@picknik.ai> * Update 2021-07-08-moveit-galactic.md Andy's update for moveit/moveit#2698 * Update 2021-07-08-moveit-galactic.md re-order bullets * Fix bullet Co-authored-by: AndyZe <zelenak@picknik.ai>
…cope error This optimization was originally introduced in moveit#2698 to avoid string copying.
@AndyZe Do you remember your # collision_detection::CollisionRequest
distance: True # Compute proximity distance (Slow)
cost: False # Compute collision cost
contacts: True # Compute contacts
max_contacts: 1 # Maximum number of contacts
max_contacts_per_pair: 1 # Maximum numbe of contacts per pair of bodies
max_cost_sources: 1 # Defines how many of the top cost sources should be returned
verbose: False # Report information about collision
How much boxes was in the I'm testing regression across MoveIt versions specifically on collision checks so this would be a good benchmark to validate the optimisation of this PR. |
I remember it was just one box in the scene. Not that you need to copy what I did. I don't remember the other parameters |
While reviewing #2682 I noticed there are some variables declared inside the FCL distanceCallback. That's pretty bad, performance-wise, for a function that should be fast.
This simply makes these variables
thread_local
so they don't have to be allocated with every callback.I see a 52% reduction in distanceCallback duration. The spreadsheet is attached and you can see how it was tested in the commit history (std::chrono::high_resolution_clock). Mean duration drops from 8.75 microseconds to 4.23 microseconds. Worst-case duration drops from 43->21 us. I can give PickNik'rs guidance on how to reproduce the test setup.
print_of_distance_callback.ods
I think there's lots of opportunity to do this type of improvement in other places within the collision-checking code.