Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple subscribers are subscribed. Killing one with 'kill -9' affects the other subscribers. Even after restarting the subscriber, data is still not subscribed to. #5469

Open
1 task done
Eternity1987 opened this issue Dec 9, 2024 · 7 comments
Labels
triage Issue pending classification

Comments

@Eternity1987
Copy link

Eternity1987 commented Dec 9, 2024

Is there an already existing issue for this?

  • I have searched the existing issues

Expected behavior

The publisher is publishing RGB data of around 3MB at a frame rate of 30Hz. The subscriber should receive the data at a frequency of 30Hz.

Current behavior

Receiving data works correctly using shared memory, but not with UDP. However, the shared memory issue where the shared memory file sometimes gets deleted or not deleted after running for a while and data is not received could be due to various reasons. but fastdds write api return ok.

Steps to reproduce

ENV: ubuntu 22.04
fastdds@test:~/home/fastdds$ stress-ng --cpu 0 --io 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 604800s
stress-ng: info: [1409220] dispatching hogs: 12 cpu, 4 io, 2 vm, 4 fork
image

sudo sysctl -w net.core.rmem_max=2147483647
sudo sysctl -w net.core.rmem_default=2147483647
sudo sysctl -w net.core.wmem_max=2147483647
sudo sysctl -w net.core.wmem_default=2147483647

then , normal publish and sub.

Fast DDS version/commit

2.14.3

Platform/Architecture

Other. Please specify in Additional context section.

Transport layer

UDPv4,SHM

Additional context

No response

XML configuration file

No response

Relevant log output

No response

Network traffic capture

No response

@Eternity1987 Eternity1987 added the triage Issue pending classification label Dec 9, 2024
@Eternity1987
Copy link
Author

Eternity1987 commented Dec 10, 2024

image

When using a shared memory channel, the shared file ran for a while and was deleted at some point in time. My project did not actively delete it. Is there any mechanism within FastDDS that could have deleted it?

@Eternity1987
Copy link
Author

Eternity1987 commented Dec 11, 2024

image
image
image

Within a Participant, one publisher, create three writer. Two of them publish sensor_msgs::msg::Image messages at 30Hz each to the topics /sensor/camera/head_front_rgbd/color/raw and /sensor/camera/head_front_rgbd/depth/raw respectively. The third publisher publishes std_msgs::msg::String messages at 100Hz to the topic /simple_topic, with a QoS of best-effort and a depth of 10. Five subscribers are then started: two subscribe to /sensor/camera/head_front_rgbd/color/raw, two subscribe to /sensor/camera/head_front_rgbd/depth/raw, and one subscribes to /simple_topic. At this point, everything functions as expected. However, if one or more subscribers are killed using a kill -9 command and then restarted, the remaining subscribers stop receiving data. The on_lost callback keeps printing, and in some cases, the program freezes without any progress.

@Eternity1987
Copy link
Author

Eternity1987 commented Dec 11, 2024

image Within a Participant, there are three publishers. Two of them publish sensor_msgs::msg::Image messages at 30Hz each to the topics /sensor/camera/head_front_rgbd/color/raw and /sensor/camera/head_front_rgbd/depth/raw respectively. The third publisher publishes std_msgs::msg::String messages at 100Hz to the topic /simple_topic, with a QoS of best-effort and a depth of 10. Five subscribers are then started: two subscribe to /sensor/camera/head_front_rgbd/color/raw, two subscribe to /sensor/camera/head_front_rgbd/depth/raw, and one subscribes to /simple_topic. At this point, everything functions as expected. However, if one or more subscribers are killed using a kill -9 command and then restarted, the remaining subscribers stop receiving data. The on_lost callback keeps printing, and in some cases, the program freezes without any progress.

image
The occurrence of this issue is quite frequent, sometimes taking a long time to recover, while other times not recovering at all. Is it because they are sharing the same shared memory? When debugging, I noticed that the sequence numbers were significantly lagging behind, constantly triggering callbacks. Should I consider using multiple Participants instead? Are there any good solutions to resolve this issue?

If one of the subscribers has a breakpoint set during gdb debugging, it can also lead to this situation, affecting the other subscribers as well.

@Eternity1987
Copy link
Author

Eternity1987 commented Dec 12, 2024

image Within a Participant, there are three publishers. Two of them publish sensor_msgs::msg::Image messages at 30Hz each to the topics /sensor/camera/head_front_rgbd/color/raw and /sensor/camera/head_front_rgbd/depth/raw respectively. The third publisher publishes std_msgs::msg::String messages at 100Hz to the topic /simple_topic, with a QoS of best-effort and a depth of 10. Five subscribers are then started: two subscribe to /sensor/camera/head_front_rgbd/color/raw, two subscribe to /sensor/camera/head_front_rgbd/depth/raw, and one subscribes to /simple_topic. At this point, everything functions as expected. However, if one or more subscribers are killed using a kill -9 command and then restarted, the remaining subscribers stop receiving data. The on_lost callback keeps printing, and in some cases, the program freezes without any progress.

image The occurrence of this issue is quite frequent, sometimes taking a long time to recover, while other times not recovering at all. Is it because they are sharing the same shared memory? When debugging, I noticed that the sequence numbers were significantly lagging behind, constantly triggering callbacks. Should I consider using multiple Participants instead? Are there any good solutions to resolve this issue?

If one of the subscribers has a breakpoint set during gdb debugging, it can also lead to this situation, affecting the other subscribers as well.
image

image

fastdds_ws.zip
This is my project code. During compilation, you may need to replace type_(new Image::PubSubType()) with the PubSubType generated by fastddsgen.

@Eternity1987 Eternity1987 changed the title topic monitor sensor_msgs/image/image, hz is too low, even 0 Multiple subscribers are subscribed. Killing one with 'kill -9' affects the other subscribers. Even after restarting the subscriber, data is still not subscribed to. Dec 12, 2024
@Eternity1987
Copy link
Author

Eternity1987 commented Dec 13, 2024

four sub
image

one pub
image

@jwillemsen Could you help me with this issue? Is it because it's too simple? But I think this scenario is very common. Why would there be this issue? Fast DDS should be able to support distributed systems and one-to-many scenarios well, shouldn't it?🙌🙌🙌👏

@cferreiragonz
Copy link
Contributor

Hi @Eternity1987,

Is there any specific reason for using kill -9 to terminate subscribers? This method prevents the application from properly closing shared files, which could lead to some of the issues you're encountering.

Could you also try using the Fast DDS CLI tool to clean the shared memory files before restarting the subscribers? This might help resolve the problem. You can find more details about the CLI tool and its usage here and it is very easy to test, just make sure you compile it and run: fastdds shm clean

Hope this works for you!

@Eternity1987
Copy link
Author

Hi @Eternity1987,

Is there any specific reason for using kill -9 to terminate subscribers? This method prevents the application from properly closing shared files, which could lead to some of the issues you're encountering.

Could you also try using the Fast DDS CLI tool to clean the shared memory files before restarting the subscribers? This might help resolve the problem. You can find more details about the CLI tool and its usage here and it is very easy to test, just make sure you compile it and run: fastdds shm clean

Hope this works for you!

Thank you for your reply! I just want to simulate that when there are multiple subscribers, a program crashes abnormally and a daemon will pull it up again. Why does this sub affect other sub? And when the program starts, it will call fastdds shm clean to clean the zombie files first, but there is still this problem, and I think it should not affect my subscription to other programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Issue pending classification
Projects
None yet
Development

No branches or pull requests

2 participants