-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing lidar chunk os1-128-u #240
Comments
I have the same issue with an OS1 128 in ROS2 humble/iron (tested both). The higher the resolution, the more the PointCloud in RVIZ2 is "flickering" and missing chunks. The fix from here - increasing the UDP receive buffer size - seemed to work at first, but after a day, the flickering started again, so it seems to have been pure luck. I can add, that even in the time, the fix seemed to have worked, it only did so while the ouster driver was running in a docker container started with |
any update on this issue? I have noticed it on os0 32 channels as well, running latest driver. |
Any Update? I have the same with OS2-128. When i subscribe to it more often it gets even worse. Also it depends on the lidar_mode. Below 1024x10 it works with rviz as a single subscriber. When i want to do a recording with the record.launch file I get chunks missing again. I hope we get some advice soon. |
We are encountering the same issue when configuring our OS1-128 with a resolution of 2048x10. It functions properly in OusterStudio, but when utilizing the ROS2 driver, it experiences flickering.
Thanks |
@lionator and @javierAraluce I have just pushed a new release towards ros2 (rolling,humble, iron) which has several fixes and improvements. Could you try the updated version 0.12.0 and let me know whether this fixes the problem you are observing or not? Thanks |
I have done it. It works now with 2048x10 with one subscriber, but if 2 or more subscriber are added it stops working. See here: |
I also tried the new release and it doesn't change anything on my machine. I can also confirm that with more subscribers, a larger portion of the pointcloud is missing on average. |
Thanks for the feedback, I don't usually test with multiple subscribers. I will look into the issue. |
If you test it with the record.launch.xml file and rviz visualisation is active you have 2 subscribers and it does not work. |
Hello all, |
Does this have to do with the socket buffer size? ros-drivers/ros2_ouster_drivers#89 |
@JonasTietz I tried to reproduce this issue under ROS1 on NVIDIA AGX with a sensor connected to a switch but I couldn't generate the problem with the latest ouster-ros release. Are you able to generate the problem when connected directly to the sensor or maybe try a different network interface? I remember one of our users noted that there were issues with certain network hardware which could be the cause of the problem. I am running a long term test to check for any degradation in the performance. I was able to re-produce the issue with ROS2 and looking further into it, but since the issue is specific to ROS2 it might be what @tom-bu had mentioned. |
@JonasTietz I ran 12 hours long test on my setup and I didn't observe any degradation in the driver performance using the ROS1 driver. I used 1024x10 lidar mode and the RNG19_RGL8_SIG16_NIR profile and the frame rate was at the expected 10 Hz. I am going to argue that your network setup has to do with the issue for ROS1. Could you please check if you can produce the problem without using the described network switch? I am still investigating the issue with the ROS2 driver. |
Hella @Samahu , We also seen this issue when not using the interface box and directly plugging it into ethernet port. But I wanted to make sure that this is not the issue we were seeing. |
@JonasTietz thanks for the feedback, will keep the ticket open untill we have the issue fully resolved. |
Hi, we have been encountering the same issues using our Most of my following assumptions are coming from the fact that I have a hacked-up version of the old community driver (with ported functionality for the new firmware) in this fork branch that does not appear to have any of these issues (or they're at least a lot less prevalent). So, while there might be some hardware-related differences that influence this behavior, that would still not explain why there is one software implementation that works and one that has problems, when using the same hardware. It would be good if someone else could also confirm this. Here, I am always using the single ouster driver node, so the raw packets must be getting processed directly between the First, by counting up the the accumulated packets in
But those counts are never reached, hence we're missing chunks of the cloud. Going further and printing out the individual I performed various experiments with different lidar output modes and subscriber counts to compare the behavior of the two driver implementations. To make sure I'm doing the counting as close as possible to the actual moment any new packet reaches the socket, I've added the counting directly into the lambda call in
I've also added the same to my modified version of the community driver, so I can compare the outputs from both, like for like. All of the subscribers were run on the same machine that was running driver node. I haven't performed any tests with subscribers being on a different machine. Testing configuration 1:
Testing configuration 2:
I don't have any pretty graphs for this (that would be too much work), but there seems to be some correlation between the packet losses and cpu usage, as well as, the data rate. Sometimes the missing section always starts from the exact same id - these are the consistent missing chunks situations. But once packets start missing randomly, the whole visualization breaks apart. With the community version, there are practically no continuous missing events and the cloud visualization never breaks down. Also, I'm not sure if there's any correlation with the network connection path (at least from my side). The routers were not reporting any dropped packets or routing errors, and I also connected the lidar directly to the pc with a static IP and the behavior is exactly the same. These were my first naive assumptions about what could be happening:
What I see for now is:
|
@Imaniac230 thanks for your detailed report, this very helpful. When porting the ROS1 driver to ROS2 and implementing the combined node I haven't tested its behavior with multiple subscribers at work hence I haven't noted this issue. I think one notable difference is that the community driver utilize a lock-free ring buffer which definitely has better advantages over the current ThreadSafeRingBuffer implementation, it was my intention to switch to lock-free implementation per my comment but I considered this an optimization and deferred it to later. This may not be the sole problem but a lock-free implementation - if done right - would perform better since it doesn't need to hold a lock on every push or pop of items. Another thing is I never actually checked or performed a benchmark on the queue length and how often it gets filled. As you have noted in your report that with a single subscriber there isn't any missing packets so the allocated buffers seem to work just fine. There is also one factor which I would be curious if you took a look during the comparison, when comparing my implementation vs the community driver - using a single subscriber - it was note worthy to me that the community driver had less CPU utilization but at the same time it had less frequency of publishing point clouds as the LidarMode increased link to plots. |
Adding a data point here, I have two Lenovo Thinkpad laptops (different models) running ubuntu 22 with ROS Humble. I can see missing packets on one but not the other. Funnily enough, the laptop where everything works is an older model with a weaker processor. I wonder if this is related to the network adapter in use? Maybe just a fluke? |
Thanks all for the valuable feedback, I have identified the problem and will be working on a fix soon. The issue stems when the buffer read operation takes more time than available window before we receive another packet. If this happens then we'd run into the problem. Since I perform packet processing and publishing in the same thread that reads packets that problem will be exacerbated when having more than one subscriber to the point clouds node. I will work on a fix to address the issue very soon. |
Regarding the topic rates, I've taken a quick look at the pointcloud topic from both implementations (just with the The rates do start to drop as more subscribers (4-5) are being added, but it is the same for both implementations. |
It's also interesting to me that the problem relates to the buffer |
This issue also happend with OS1-128 and OS2-128 Rev7 with Ubuntu 20.04 & ROS noetic. I have changed harness and power supply but won't fix. Seems like thread timing issue so I estimate that CPU usage is the key, but any updates or way to deal with this? |
After another batch of experiments, I think there two separate problems.
If we keep track of the read and write operations and print their difference for ex. like this:
and, again, add the same thing to the community implementation, then we get the following output:
The community driver does slowly overflow the buffer once the third subscriber is started. In fact, the flickering that I mentioned in my previous tests is caused by the writer periodically overtaking the reader with overwritten data. In this driver, the buffer never overflows, because the packets are never written into it. We're just gradually dropping more and more of them. If the ringbuffer is refactored with:
then the missing packets no longer occur and it starts behaving the same way as the community implementation. These are the modified callbacks that I am using:
The members are:
So the first problem could be solvable by refactoring the ringbuffer (now the behavior is the same as the community version - we're not missing any packets, but the buffer will overflow with too many subscribers). Since the writing thread is no longer holding back on the packets, they are actually written into the buffer, which will overflow if the reading thread can not keep up. I think you mention separating out the reading and processing into different threads (#240 (comment)), which could be a solution for the second problem. // |
I would like to comment that I am also having this issue with my Ouster OS-2-128, Firmware v2.5.2, UDP Profile LEGACY, pulled ROS2 branch today. I am running it on 22.04 and Iron. If I use 2048x10 mode, than I see lost data from about 270 deg to 360 deg (https://static.ouster.dev/sensor-docs/_images/lidar-frame-top-down.svg). Sometimes less, sometimes flickers to almost 240 deg. This makes it practically unusable. I can confirm that if I switch to 1024x10 that I do NOT see this issue. This is an okay work around for now, but not ideal for sure. I can also confirm that this is NOT an issue when using Ouster studio. |
I did some more testing today, and though the flickering and lost data is a lot less, I can still see it at 1024x10. Is this an issue with me using an older model of the OS-128 or the UDP LEGACY mode? I am a bit baffled that everyone is not seeing this issue and that there is not a clear understanding why. It is easily reproducible for my setup. |
Are you building the ouster driver in Release mode? |
Yes I am building in release.
Here is a video showing the issue at 2048x10. Screencast.from.04-01-2024.08.51.12.PM.webmI ran Ouster Studio (https://ouster.com/products/software/ouster-studio) and I do not have any issues, thus I am assuming that this is a ROS2 issue. Maybe a RVIZ2 issue? I checked the CPU usage, and it is not high. |
I switched the UDP profile from LEGACY to RNG19_RFL8_SIG16_NIR16, but still the same flickering. |
Here it is at 1024x10. It is better, but there is still a small flicker occasionally. Screencast.from.04-01-2024.09.06.42.PM.webm |
what's the platform on which you're testing this? did you try out with some powerful CPUs to eliminate performance throttling? |
Laptop with i9-9980HK (2.4GHzx16) with 32GB ram. Not the fastest, but should be more than plenty. I don't have a faster computer with ROS2 available. I was processing a Velodyne HDL64 using ROS1 over 8 years ago on a computer a fraction of this capability without issue. So, I hope that the hardware is not an issue. Again, Ouster studio works perfectly and I did not see any obvious CPU limiting when looking at the system monitor. It could be something that I have done wrong with ROS2 or a driver configuration issue. ROS2 has a lot more nuance in setting it up then ROS1, especially with QoS. I am launching it with.
Here is the driver_params.yaml file that I am using. Maybe there is something wrong in here? Thanks for you help. Sorry if I seem a bit frustrated. I finally have a project to use this amazing lidar, and I have spent a lot of time this last week or two trying to get it to work. I don't really have the time/money to start from scratch with the Ouster SDK. |
no worries. @Samahu are there any updates concerning solving this issue on ROS2? |
Here is what I see in Ouster Studio with 2048x10. It looks pretty good with no large regions of data missing. There is some very minor flickering here and there which only occurs during the recording of the video. Maybe that is some indication that the Ouster lidar code is very sensitive to system load or CPU performance. However, without recording it looks 100% correct without any flickering. Maybe because of ROS2 or something in how the ROS2 driver uses the Ouster SDK (assuming that it does) amplifies this sensitivity? Screencast.from.04-03-2024.11.36.49.AM.webm |
Hi All, please consider taking a look at any of the three following PRs (#319, #320, #321) - whichever applies to your current ROS distribution - and check if it helps resolve this issue without a regression in performance. I did take close look at #302 approach solution and I think it does help too but I couldn't verify it completely because under ROS2-foxy which is what @Imaniac230 targeted for the solution I don't seem to be able to reproduce the issue on my laptop no matter how many subscribers I add. In the fix that I am submitting I use less synchronization overall. I have provided my analysis of the issue and the motivation behind this approach. I do have one follow up item before I consider the solution complete but what I have pushed so far still resolves this problem. Please if you can give it a try and provide feedback whether this helps with your situation or not (or any relevant feedback). 🙏 |
@Samahu, I tested PR #321. I have good and bad news. I really appreciate your help with this. The good news is that it does indeed perform better, a lot better. I can now run 1024x10 while actually processing the point cloud with voxel filtering, normal estimation and filter, and ICP. The flickering and lost data seems to be gone. However, the bad news is that at 2048x10 I still have issues. It is significantly better where I am only seeing minor drops of data and not huge chunks, but I still see the flickering. It is minimal when just running the lidar driver alone, but it does exist. If I run my same processing, then the flickering is consistently an issue. Also, I see the output rate of the driver drop from a solid 10Hz at 1024x10 to more like 8.5Hz at 2048x10. I would expect that a gaming laptop or desktop grade processor may be able to successfully run at 2048x10. I am not sure what exactly is happening in the processing of scan packets to a point cloud that requires so much CPU. My more moderate computer and laptop that I use for my USV clearly struggles. It is an understandable and workable issue for me, but clearly disappointing. |
@mwhannan74 Thanks for sharing the feedback, while waiting for further feedback for others here.. have you considered increasing your system's recv buffer size by performing? sudo sysctl -w net.core.rmem_default=26214400
sudo sysctl -w net.core.rmem_max=26214400 Also if you only need the point clouds consider setting the proc_mask launch parameter to EDIT: also as I noted in the PR(s) I still want try to eliminate/avoid the extra copy of packets between the packets queue and the added LidarScan buffers. |
@Samahu, I have great news to report. It now works at 2048x10! Increasing the system's recv buffer size fixed the blinking data issue. For good measure I also increased the MTU size to 9000, like I do with my GigE cameras just in case (not sure if it is actually needed). However, I was only seeing an output rate of 8.5Hz. I then configured the proc_mask:="PCL" and the rate went up to a full 10Hz! Thank you for all of your help. |
awesome, thanks for sharing! |
@Samahu I'm using foxy in my default dev system, but if it helps with comparisons and evaluations in any way I can also quickly create target PRs with the same changes for humble (and possibly noetic). |
I can do a comparison using the foxy branch I created once I get back to it (very soon 🤞 ) |
Describe the bug
We are using a Nvidia Jetson AGX xavier on a Auvidea X221-AI carriert board and we experience some missing pointcloud chunks on many of the pointclouds we receive. The missing chunk is always in the same spot, but sometimes larger and sometimes smaller. This happens on all udp profiles except for the Low Data Rate Profile. On the dual return profile it is more extreme.
We are on commit 6a7693c and had nothing running in the background on a fresh ubuntu 20.04 install. The lidar is connected to a "RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller". Lidar firmware version is v2.5.2
Here is a video of the behavior.
lidar_missing_chunks_small.mp4
To Reproduce
Steps to reproduce the behavior (steps below are just an example):
Platform:
The text was updated successfully, but these errors were encountered: