Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provides a simple way to selectively turn on/off data processors #46

Merged
merged 2 commits into from
May 27, 2020
Merged

Provides a simple way to selectively turn on/off data processors #46

merged 2 commits into from
May 27, 2020

Conversation

tpanzarella
Copy link
Member

@SteveMacenski and I have spoken a bit about how to make the data processors runtime selectable. We have contemplated several methods to include converting the data processors to plugins (using pluginlib), static polymorphic methods like CRTP, etc. However, Steve mentioned safety of runtime loadable code and also a desire to urge the community to contribute back any useful processors they may have -- i.e., if you want a new data processor in the driver, push it to us and we will compile it in. In that spirit, I think this PR can provide this capability to us and gives us some really nice side benefits. The primary feature is that we can selectively, through a parameter, choose which data processors are active at runtime.

This PR introduces the os1_proc_mask parameter. Its value is a string (that looks a lot like a bit mask that programmers are likely to be familiar with) that encodes which processors to activate when the driver starts. By default, we start all of the known data processors that are available today (no behavior change for current users). Setting the parameter in that way looks like (in YAML):

os1_proc_mask: IMG|PCL|IMU|SCAN

The available processors can be combined in any way. Valid strings include (but are not limited to): IMG|PCL, IMG|PCL|IMU, PCL, etc.

I have a secondary motivation for this PR and it is performance. In #41 we improved latency and jitter through the system by publishing the PointCloud as a std::unique_ptr rather than a reference which eliminates at least one expensive copy in the middleware. Running my same performance tests for consuming the PointCloud data and setting os1_proc_mask to PCL (i.e., the only active data processor is the PointCloud processor) we can reduce end-to-end latency (from the time the driver sees the data to the time an application receives the data ... so time in the driver and ROS2 middleware stack) by roughly 56%. Here are some plots.

(The syntax on these plots is the same as in #41 the blue trace is the latency and jitter of the LiDAR getting the data to the driver and the red trace is the latency and jitter between the driver and the consuming application):

test-case-7_1024x10_raw_latency

And a quantile plot of the same:

test-case-7_1024x10_q_latency

Here is what the end-to-end numbers look like (this is where this PR gives us a gain):

test-case-7_1024x10_e2e_raw_latency

test-case-7_1024x10_e2e_q_latency

As was reported in #41 the improved median latency was 23.706 ms, and the MAD was 0.252 ms. By turning off all data processors except the point cloud we now have a median end-to-end latency of 10.319 ms with a MAD of 0.183 ms. This is a ~56% reduction in latency through the driver and middleware stack.

There are more gains to be had in this driver and this PR provides a foundational piece to get us there.

@tpanzarella tpanzarella requested a review from SteveMacenski May 22, 2020 21:03
@SteveMacenski
Copy link
Member

SteveMacenski commented May 22, 2020

To make sure I understand properly, the issue you have is because a packet is being buffered by multiple processors, there's an additional delay for processor type A to publish a full cloud because processor type B might called beforehand and is unused? Just verifying that this isn't a result of delays only in publishing, which would indicate a problem with the get_subscriber_count for only publishing if someone's listening. If you ran this test with anyone subscribed to the other topics (IMU, Images, etc) then those would be published and add latency. The fair comparison would be the current state of things making sure that it was launched with no rviz or other tools that might be subscribed or making connections to the other processor types.

If you did that and you see this result, sounds good I can take a look. I might argue plugins would be easier than this mask work, but I don't have an issue with the mask. It would just let us fix this problem once and then not be an issue or needing updates everytime a new processor was added. Just load what you need.

@tpanzarella tpanzarella changed the title Provides a simply way to selectively turn on/off data processors Provides a simple way to selectively turn on/off data processors May 22, 2020
@tpanzarella
Copy link
Member Author

I think what is actually happening is cache thrashing and it is as a result of how the data are delivered from the LiDAR and subsequently constructed in the driver. This is a legacy that carries back to the ROS1 driver. And it really needs to be fixed. To do the investigation and make the root cause fix, I need this PR in so that I can turn things on and off and work productively. I also may need to get Ouster involved, not sure yet.

Here are some details.... the core data structures in the driver are, as you would expect (let's assume I'm in 1024x10 mode) 64 x 1024 (rows x cols). As would be natural, in the Image Processor, 64 x 1024 images are being constructed and stored in memory in row-major-order. However, the LiDAR is delivering the data in column-major-order. With every pixel that is being populated in the image, there is a cache miss and for every loop iteration 4 images are being populated ... 4 cache misses. This becomes most evident if you look at an organized point cloud from the driver. While the shape of the PointCloud is 64 x 1024 the data are not stored as you would expect. The first column is actually stored in the first 64 "cells" of the first row. Then the next 64 cells of the first row are the second column, etc. It makes no sense and it is killing performance. It would make more sense if the data were organized as 1024 x 64, then we could at least just do one matrix transpose to reshape the data properly, but that is not how the data are being structured. It really makes no sense. While a transpose would cost us a data copy (of the whole matrix) it is better than what is currently happening.

All of that said, this PR basically provides a tool for me to do that much harder deeper work. It also allows for more cleanly adding new data processors. Which we want. As you know, I want to push out the unit vectors and distance image, which I think should be managed in a separate data processor. This gives me the framework to do that.

So, to get back to your questions again, directly, it certainly doesn't help that multiple processors loop over the data for each packet, but I really don't think that is the root cause. I think it exacerbates it because all of the looping over the pixels fights with the CPU cache machinery rather that exploiting it. This is my hypothesis.

@tpanzarella
Copy link
Member Author

One way to see what is going on is to look at the t field of the point cloud as a time image. Here is one I just recorded:

ouster_time_image

You'll notice the gradient of the image goes from violet to red from top-to-bottom. Recall, the pixel data in this image is time. This would make one think that the LiDAR scans from top-to-bottom exposing an entire ring of the receiver at a time. It does not do that. It exposes a column at a time as it sweeps in its circle. What we would actually expect in this image is the gradient changing from left-to-right.

The facts are, when you deconstruct the point cloud, it is populating an array that is shaped as we would expect 64 x 1024 (for example) but populating it by column, transposing the columns, and laying them into the rows of the matrix. It is bizarre.

@tpanzarella
Copy link
Member Author

The fair comparison would be the current state of things making sure that it was launched with no rviz or other tools that might be subscribed or making connections to the other processor types.

This is correct. The only running processes in all tests were the Ouster driver and the consuming application (perf_node) that is used to record the raw timings. Additionally, all unnecessary disk I/O for persisting the timings for off-line analysis is deferred until after all data points have been collected.

@SteveMacenski
Copy link
Member

From your first message, so my understanding is then for the image population, if we were to update the lambda to store it that way and then copy it over to the images, would that largely help?

Either way, I don't have a problem with this PR, I was trying to understand. The last question I have is whether you think that a mask is the way to go rather than implementing the plugins? That would be a "once and for all" fix to this problem by just including the processor plugins you want at runtime. It would kill 2 birds with 1 stone. You've clearly gone through some effort in this PR so I'd be OK reviewing / merging this implementation if you prefer. Though I imagine in some N weeks/months, we'll be revisiting the plugins conversation. I'm hoping to get some cycles in the near future to implement this since its come up a few times now.

@tpanzarella
Copy link
Member Author

From your first message, so my understanding is then for the image population, if we were to update the lambda to store it that way and then copy it over to the images, would that largely help?

I think there are several ways it can be addressed and we will certainly have to measure so that we can properly evaluate. However, all things being equal, since we are batching a scan, it would be nice if at the lowest layers of the library (packets or byte buffers) we deliver the data to the processors in row-major order. Then the current and new processors would not all have to deal with this artifact.

To be clear, this would be a breaking change for anyone who is doing more lower-level work with the Point Cloud (for example). I define lower-level as, not just treating it as a set of Cartesian Points but actually relying on how the shape of the data are stored today in the array storage that backs the Point Cloud.

The last question I have is whether you think that a mask is the way to go rather than implementing the plugins?

I also like the idea of plugins but it is not without its own trade-offs. Several of which are enumerated in the earlier comments of this PR. My current hesitation with plugins, and pluginlib specifically, is the risk that taking that approach my preclude some of the other frameworks we are trying to get integrated into this driver already. For example, the challenge with Lifecycle Nodes implemented as Components. It is my understanding now that things like intra-process comms can be turned on at the topic-level, not just at the Node level, so, we may be able to get around that -- I have not verified. But we still don't have a good story for Lifecycle Nodes + Components + Launch integration (for example -- i.e., to do that, we either need to provide or users have to write their own Lifecycle Manager to drive the FSM). We now throw in an additional layer of plugins for the data processors and things start to get maybe over-complicated, harder to debug, and potentially not even work together (not sure).

Bottom-line, I like the idea of plugins but maybe this simpler mask approach can get put to use today so that we can realize the benefits of selectively turning on/off the data processors. Then, we take a more holistic look at the architecture to include revisiting this topic. That is my gut feel on this today.

@SteveMacenski
Copy link
Member

For reference plugins add no additional complications at launch / components / lifecycle. They don't interact with the system at that level. You essentially just have a plugin loader object that is given a string name of a plugin and it finds the library (if exists) and makes an instance if possible. From a programming perspective, its really not much different than creating set of base pointers to a derived class and putting that pointer into the std::map.

But like I said this is fine. If we add a new processor though, that will start getting more and more messy with the masks. I'd also be OK merging this now and then if we find we want other processors, we can revisit at that point and implement plugins. I'd be willing to do that myself.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this. Merge as you're ready.

@tpanzarella tpanzarella merged commit 666bb89 into ros-drivers:eloquent-devel May 27, 2020
@tpanzarella tpanzarella mentioned this pull request Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants