-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory data layout #52
Comments
Can't easily see the notebook, the github rendering for python notebooks is terrible. But I'll take your word for it. The buffering lambda for structuring data was taken from the ouster client software they provide. Have you done a quick test to see if just structuring the data differently will help performance (e.g. just mess with the image format to me HxW rather than WxH so the buffering isn't missing caches, if that's the issue) before we full blown go to redesign it? The image processors are probably the simplest to understand to do since its a long vector of data (which a PC2 is as well, but less directly). You could also try just publishing one long ordered vector so there's absolutely no misses. It would be garbage to visualize, but it should give you a direct 1:1 for the ideal best case vs the current case to see if that is the cause of any performance issues (e.g. 1xH*W). |
Maybe try this link instead. I think it is important to digest the notebook as it really elucidates the issue. I want to make sure you are in agreement with me here before moving forward. Without the details presented in the notebook, it will be hard to convey the nature of the problem. Let me know if you get a chance to read it before I comment on your other suggestions above. |
I understood what you meant when you sent this to me in an email. I can glance over that notebook but I'm not a python developer, so its not immediately clear to me important details I'd want to understand (ex I think the only place that needs to change is this https://github.com/SteveMacenski/ros2_ouster_drivers/blob/foxy-devel/ros2_ouster/include/ros2_ouster/OS1/OS1_util.hpp#L118 batch iterator function. I think you just want to embed the information differently to be transposed. So I'm not sure we need to be too much on the same page since the change should be small and the result should be evident if it lowers resource utilization. I believe the data is coming in per encoder tick fire (1/1024th azimuth) so its just how we batch up that data when it comes in needs to change. unless what you're proposing is that the information coming over the wire is coming in from full revolution beams, rather than over time beams, which I would be suspect of Does that make sense? It should be straight forward to change that to test quickly if manipulating the batching up of data will fix your issue. |
I assumed from the narrative of stating the LiDAR mode (1024x10) and general knowledge of the LiDAR (it always sends back 64 "rings" of data) this would not be ambiguous even without knowledge of Python or numpy. Sorry. In this case
OK. There is not that much Python in there though. There is lots of English narrative and supporting graphics as well. That I took quite a bit of time to try to articulate. The Python is there so we can look at data inline with the narrative. It has been my experience that people like to see actual executable code to support a "report".
The narrative (English not Python) goes on to explain that it is not just transposed data. The meta data on the point cloud claims
Now, on the last point, I am not sure if ROS specifies that the
Fair enough. I explicitly used the word "peek" in the narrative for this reason. However, I'd argue that understanding the Ouster LiDAR data and seeing this time image data and knowing (since it was stated in the analysis) that the LiDAR was running in
That is not how it works. Yes it is a flash LiDAR but it only exposes a column at a time. So, in the case I used in the notebook of running in
I will take a look at it and provide an implementation that solves the problem I am bringing up here. Once I have it, we can review it. If you are OK with that. I won't be able to look at this again until (at best) end of next week. |
Let me read over again this in more detail tomorrow and come back with more tangible thoughts. I finally just got ROS / python3 playing well together on my machine so I can actually reproduce that notebook now and play around to get an intuition. Before now, my machine's been basically bricked with ROS / python3 due to 18.04->20.04 issues (this week until now I've just been debugging and writing documentation in the mean time 😉). |
I still don't understand why they're all not Doing
Shows me that your (64, 1024) is 64 columns and 1024 rows, which is the same as how we use the data in ROS. So the Later you say in the document
Which is suspect to me, since the shapes we just looked at seems to indicate that the first column is indeed the first column in this numpy array. I can't explain the contents of the array, those numbers seem all wrong to me, but the shape seems to indicate that it is correct. Moving forward in your analysis
I agree if that was indeed a measure of time, that that's very strange. It should be smooth gradient from 0 -> 1024 and all of the rings 0->64 should be the same color. This makes me suspect that what you're plotting isn't actually time (either from a mistake on the sensor or in parsing the packets) or the points aren't actually in a structured order. Lets narrow in on the order assuming that the times are actually correct. You didn't indicate at the start of this analysis what the H5 file is of. If this is a pointcloud message from the output of the driver, then the above could be explained because the order is bogus, but then why are you plotting it as if it is in an image grid? If in What I would challenge you to do at this point is do the same thing with the image processors. Those should be in correct orders, because otherwise the image values would look bogus. So if you instead took the intensity image processor or something, and repeated this experiment, it should be in the correct gradient order. Assuming that your assertion here is that the pointcloud isn't in a structured order due to the batch processing. If your assertion that the order is weird from the sensor, then the image processors would be constantly reordering and sticking points around to get a structured image value. It would be valuable to know who's unordering things; the sensor or the batch processor. You could create a time image processor as well and the values of that directly should be a very similar experiment. tl;dr, after diving into this, I see that there is an issue, but I don't think this experiment identified whether the issue is in the driver or the order that data is sent from the sensor. That should be something relatively easy to determine by having the image / pointcloud processors print to screen the azimuth and elevation numbers on reception. If they're in order, then its the batch processor for the pointcloud. If they're not in order, the batch processor for the pointcloud is just translating the information it receives in the order it receives it. |
Or I went down the totally wrong rabbit hole and what we're plotting isn't actually time for some reason, but I think from your |
No. Look at your data again. It is rows x cols.
It is time. This is the time image. The units are nanoseconds relative to the first column firing, which is why the first 64 entries are zeros. Then every column after that is relative nanos from 0. Just convert them if you don't believe it. Running at 10 Hz you will see this is a 100 ms scan. Convert the last column, for example, you will see it is ~100 ms after the first column firing. It is correct.
Two things are happening. First I am reshaping from 64 x 1024 to 1024 x 64. Then the Anyway, I have a fix already in place in my tree. It works correctly. I just want to take some performance metrics on it. I think the hit will be minimal. I'll write up the technical details in the PR. Sorry, it's almost 7PM here on Friday night and I'm trying to get dinner together for my kids, so, I cannot elaborate more at the moment. |
Ah got you - sorry, yes, 64x1024, got you.
Ordering the points has a performance hit? That's not great. Is that because the data is coming in wrong or the batching up is wrong? |
Yeah. The performance hit is because once we have the batched scan as one big linear array in column-major order, we want to reorder it to row-major order. To do that, we traverse the source array, by column, which is fine and totally performant / cache friendly as the data are column-ordered. But writing into the destination array, we have no choice but to start filling it in in a very irregular memory access pattern which basically results in cache thrashing. Unfortunately, there is no way around it. As I mentioned above, I'll measure and quantify the impact, which I think will be minimal. I say that, because there was already a copy being done from the PCL data structure to the ROS Message prior to publication. So, we still make a copy, but rather than a big If you are unfamiliar with these issues, I would highly suggest you watch Scott Myer's CPU Caches talk. Specific to this discussion, look at 23:30 - 27:00 in the linked to video. I think it may help elucidate some things. The whole video is well worth your time though if you haven't seen it.
Its tough to say "wrong". On the data input side, the LiDAR fires by column, and batches up 16 columns at a time which are then sent over UDP to the driver. The driver then writes those 16 columns worth of data into the allocated memory buffer (by column). Then it gets the next 16 columns, and so on until there is a whole scan. Once the scan accumulates, that buffer is simply As of right now, I think the best thing we can do is what I have done in my tree and what I have explained here. I'll get the PR out soon. But, it actually does make me think, and I've had several early conversations about this with some people... maybe we need a new data structure (and associated algorithms to operate on those data) that more closely models 3D ring-based LiDAR. One that recognizes a priori that these things are inherently columnar, sparse, etc. Anyway, that is not our problem to solve in this issue or this driver. I think the PR that I'll submit that reorders the data in the way you would expect will be a major enhancement (bug fix?) to what we are talking about here. |
I'm familiar with caching. We can chat in the PR I suppose. I think there's a slightly more intuitive solution that would avoid an extra traversal. |
There is already an implicit copy in the PCL to ROS data structure conversion. This is why I did it there -- same number of copying as before. Just a reordering as well in my implementation. I think changing it elsewhere requires more areas of the code to be touched. Without tests, I am not ready to take on the task to ensure the correctness. We consume the point cloud. |
Now there are 2 copies though, which is what I'd like to avoid if possible. There's a copy from the buffered pointcloud -> dense pointcloud -> sensor_msgs/PointCloud2. If we modified the buffering function, then it would be only buffered pointcloud -> sensor_msgs/PointCloud2. I'm pretty sure the PCLPointCloud ROS-PCL shim conversion still involves a copy. But better than nothing. I might though see if I can't find some time in the next couple of days to prototype a solution. |
Right. So, I eliminated that one, but doing the re-ordering and copying in one step, then just calling |
Ah ok. That works for me then 👍 |
In #46 I outlined some puzzling issues as it relates to how the driver organizes its in-memory data buffers. I created this notebook to more clearly articulate the issue and its downstream effects.
I'd like to use this issue to discuss this topic and come to a consensus on a path forward. I am happy to take on the work to play with some potential implementations and see how it affects, among other things, the performance of the system.
As is more completely articulated in the notebook linked to above, the current in-memory data layout has implications for performance, usability, and correctness. Per the latter, given how the data are laid out in memory an argument could be made that the meta-data on the point cloud is inconsistent with these data and could be considered a bug. I think as it stands today, users who simply use the
PointCloud2
and/orPCL
interfaces to these data may be insulated from those issues. However, users developing deeper perception algorithms that access the array data directly, and assume it is modeled consistent with the LiDAR physically, will certainly be affected.Let's discuss openly and come to consensus.
The text was updated successfully, but these errors were encountered: