-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provides a simple way to selectively turn on/off data processors #46
Provides a simple way to selectively turn on/off data processors #46
Conversation
To make sure I understand properly, the issue you have is because a packet is being buffered by multiple processors, there's an additional delay for processor type If you did that and you see this result, sounds good I can take a look. I might argue plugins would be easier than this mask work, but I don't have an issue with the mask. It would just let us fix this problem once and then not be an issue or needing updates everytime a new processor was added. Just load what you need. |
I think what is actually happening is cache thrashing and it is as a result of how the data are delivered from the LiDAR and subsequently constructed in the driver. This is a legacy that carries back to the ROS1 driver. And it really needs to be fixed. To do the investigation and make the root cause fix, I need this PR in so that I can turn things on and off and work productively. I also may need to get Ouster involved, not sure yet. Here are some details.... the core data structures in the driver are, as you would expect (let's assume I'm in 1024x10 mode) 64 x 1024 (rows x cols). As would be natural, in the Image Processor, 64 x 1024 images are being constructed and stored in memory in row-major-order. However, the LiDAR is delivering the data in column-major-order. With every pixel that is being populated in the image, there is a cache miss and for every loop iteration 4 images are being populated ... 4 cache misses. This becomes most evident if you look at an organized point cloud from the driver. While the shape of the PointCloud is 64 x 1024 the data are not stored as you would expect. The first column is actually stored in the first 64 "cells" of the first row. Then the next 64 cells of the first row are the second column, etc. It makes no sense and it is killing performance. It would make more sense if the data were organized as 1024 x 64, then we could at least just do one matrix transpose to reshape the data properly, but that is not how the data are being structured. It really makes no sense. While a transpose would cost us a data copy (of the whole matrix) it is better than what is currently happening. All of that said, this PR basically provides a tool for me to do that much harder deeper work. It also allows for more cleanly adding new data processors. Which we want. As you know, I want to push out the unit vectors and distance image, which I think should be managed in a separate data processor. This gives me the framework to do that. So, to get back to your questions again, directly, it certainly doesn't help that multiple processors loop over the data for each packet, but I really don't think that is the root cause. I think it exacerbates it because all of the looping over the pixels fights with the CPU cache machinery rather that exploiting it. This is my hypothesis. |
One way to see what is going on is to look at the You'll notice the gradient of the image goes from violet to red from top-to-bottom. Recall, the pixel data in this image is time. This would make one think that the LiDAR scans from top-to-bottom exposing an entire ring of the receiver at a time. It does not do that. It exposes a column at a time as it sweeps in its circle. What we would actually expect in this image is the gradient changing from left-to-right. The facts are, when you deconstruct the point cloud, it is populating an array that is shaped as we would expect 64 x 1024 (for example) but populating it by column, transposing the columns, and laying them into the rows of the matrix. It is bizarre. |
This is correct. The only running processes in all tests were the Ouster driver and the consuming application ( |
From your first message, so my understanding is then for the image population, if we were to update the lambda to store it that way and then copy it over to the images, would that largely help? Either way, I don't have a problem with this PR, I was trying to understand. The last question I have is whether you think that a mask is the way to go rather than implementing the plugins? That would be a "once and for all" fix to this problem by just including the processor plugins you want at runtime. It would kill 2 birds with 1 stone. You've clearly gone through some effort in this PR so I'd be OK reviewing / merging this implementation if you prefer. Though I imagine in some |
I think there are several ways it can be addressed and we will certainly have to measure so that we can properly evaluate. However, all things being equal, since we are batching a scan, it would be nice if at the lowest layers of the library (packets or byte buffers) we deliver the data to the processors in row-major order. Then the current and new processors would not all have to deal with this artifact. To be clear, this would be a breaking change for anyone who is doing more lower-level work with the Point Cloud (for example). I define lower-level as, not just treating it as a set of Cartesian Points but actually relying on how the shape of the data are stored today in the array storage that backs the Point Cloud.
I also like the idea of plugins but it is not without its own trade-offs. Several of which are enumerated in the earlier comments of this PR. My current hesitation with plugins, and Bottom-line, I like the idea of plugins but maybe this simpler mask approach can get put to use today so that we can realize the benefits of selectively turning on/off the data processors. Then, we take a more holistic look at the architecture to include revisiting this topic. That is my gut feel on this today. |
For reference plugins add no additional complications at launch / components / lifecycle. They don't interact with the system at that level. You essentially just have a plugin loader object that is given a string name of a plugin and it finds the library (if exists) and makes an instance if possible. From a programming perspective, its really not much different than creating set of base pointers to a derived class and putting that pointer into the But like I said this is fine. If we add a new processor though, that will start getting more and more messy with the masks. I'd also be OK merging this now and then if we find we want other processors, we can revisit at that point and implement plugins. I'd be willing to do that myself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this. Merge as you're ready.
@SteveMacenski and I have spoken a bit about how to make the data processors runtime selectable. We have contemplated several methods to include converting the data processors to plugins (using pluginlib), static polymorphic methods like CRTP, etc. However, Steve mentioned safety of runtime loadable code and also a desire to urge the community to contribute back any useful processors they may have -- i.e., if you want a new data processor in the driver, push it to us and we will compile it in. In that spirit, I think this PR can provide this capability to us and gives us some really nice side benefits. The primary feature is that we can selectively, through a parameter, choose which data processors are active at runtime.
This PR introduces the
os1_proc_mask
parameter. Its value is a string (that looks a lot like a bit mask that programmers are likely to be familiar with) that encodes which processors to activate when the driver starts. By default, we start all of the known data processors that are available today (no behavior change for current users). Setting the parameter in that way looks like (in YAML):The available processors can be combined in any way. Valid strings include (but are not limited to):
IMG|PCL
,IMG|PCL|IMU
,PCL
, etc.I have a secondary motivation for this PR and it is performance. In #41 we improved latency and jitter through the system by publishing the PointCloud as a
std::unique_ptr
rather than a reference which eliminates at least one expensive copy in the middleware. Running my same performance tests for consuming the PointCloud data and settingos1_proc_mask
toPCL
(i.e., the only active data processor is the PointCloud processor) we can reduce end-to-end latency (from the time the driver sees the data to the time an application receives the data ... so time in the driver and ROS2 middleware stack) by roughly 56%. Here are some plots.(The syntax on these plots is the same as in #41 the blue trace is the latency and jitter of the LiDAR getting the data to the driver and the red trace is the latency and jitter between the driver and the consuming application):
And a quantile plot of the same:
Here is what the end-to-end numbers look like (this is where this PR gives us a gain):
As was reported in #41 the improved median latency was 23.706 ms, and the MAD was 0.252 ms. By turning off all data processors except the point cloud we now have a median end-to-end latency of 10.319 ms with a MAD of 0.183 ms. This is a ~56% reduction in latency through the driver and middleware stack.
There are more gains to be had in this driver and this PR provides a foundational piece to get us there.