Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLRPC HTTP/1.1 causes performance degradation in rosmaster between Kinetic and Melodic #2118

Open
emersonknapp opened this issue Jan 14, 2021 · 7 comments

Comments

@emersonknapp
Copy link
Contributor

emersonknapp commented Jan 14, 2021

Related to #371
Introduced in #1287

Environment

  • ROS Distro: Melodic
  • Platform: Ubuntu 18.04 Bionic
  • Linux Kernel Version: 4.15
    • special note - this doesn't occur on Linux kernel 5.4. Ubuntu 18.04.5 ships by default with 4.15, so that should be an important consideration for ROS Melodic - a Focal-based distro (e.g. Noetic) won't have to worry about this, I am not able to reproduce it on Bionic+Kernel5.4

Description

I am able to reliably cause the rosmaster to stop responding to service calls with unable to contact master - as raised by https://github.com/ros/ros_comm/blob/noetic-devel/clients/rospy/src/rospy/impl/tcpros_service.py#L467 (triggered specifically by the call to master.lookupService)

The reproduction workflow involves starting a single rospy.Service in one node (serving std_srvs/SetBool), and 11 rospy.ServiceProxy instances each in separate nodes. Each of these clients calls the service at 200Hz. After about 30 seconds, the ServiceException: unable to contact master starts to occur. The master does not crash, but is unreachable for several seconds. If the stress is stopped without stopping the master, then the situation is reproducible, suggesting there is no lasting damage done to the master process - just a temporary hang of some kind.

Repro Instructions

I am running in a container, the image was build using the following Dockerfile

from osrf/ros:melodic-desktop

run apt-get update
run apt-get install -y python3-pip
run pip3 install -U pip setuptools
run pip3 install -U colcon-common-extensions

The test application sources are testpkg.tar.gz

I run the following workflow

$ docker build . -t melodic-desktop-dev
$ mkdir src/
# move testpkg into src/
$ docker run -it -v $(pwd):/ros_ws -w /ros_ws melodic-desktop-dev

# in the container now
$ source /opt/ros/melodic/setup.bash
$ colcon build

## open up a separate shell into the container via docker exec
$ source install/setup.bash
$ roscore

## open up a third shell into the container via docker exec
$ source install/setup.bash
$ roslaunch testpkg testapp.launch --screen --wait
  • Every time I launch testapp.launch - it fails out in under a minute, meaning the rosmaster was unreachable even after several tries.
  • This same app works indefinitely on Kinetic using the same setup (just from osrf/ros:kinetic-desktop for the docker image)
  • If I check out ros_comm at melodic-devel into the workspace, revert Use HTTP/1.1 in XMLRPC Server #1287, build and run, then the app will run indefinitely

This is of course a toy stress example, but it reproduces an error we have seen in more complex applications being run in a production environent.

Next Steps

I see the following options:

  • Revert the change in Melodic, keeping it in Noetic where users will have the newer kernel by default
  • Add a check to only conditionally enable HTTP/1.1 on kernels >=5.x - putting this change on latest development branch and backporting it to Melodic

ros_comm maintainers, what do you think would be best? We will probably be able to solve this problem for our specific case by providing an environment running an upgraded kernel, but it likely affects other users, perhaps who have spent less time trying to debug its root cause. Given the 2023 EOL for Melodic, I would think we should take action rather than just wait it out.

@fujitatomoya
Copy link
Contributor

@emersonknapp

thanks for the information, this really helps 👍

Given the 2023 EOL for Melodic, I would think we should take action rather than just wait it out.

totally agree.

just a temporary hang of some kind.

but this affects entire ROS system...which i think that is not very much acceptable.

btw, do you happen to know which kernel patch fixes this problem?

@emersonknapp
Copy link
Contributor Author

btw, do you happen to know which kernel patch fixes this problem?

Sorry, I haven't had time to narrow it down, all I know is that it works on 5.4

jacobperron pushed a commit that referenced this issue Mar 11, 2021
Addressing performance issues described in #2118

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>

Co-authored-by: Emerson Knapp <537409+emersonknapp@users.noreply.github.com>
@fujitatomoya
Copy link
Contributor

@emersonknapp

can we close this? resolved in #2132

@emersonknapp
Copy link
Contributor Author

emersonknapp commented Mar 11, 2021

Yes, this is resolved! It would be good to get a ros_comm release to Melodic with the patch

@jacobperron
Copy link
Contributor

I'll do a Noetic release first, let that soak, then I can do a backport to Melodic.

@jikawa-az
Copy link
Contributor

@jacobperron Any updates for the backport? Thanks!

jacobperron pushed a commit that referenced this issue Apr 6, 2021
Addressing performance issues described in #2118

Signed-off-by: Jesse Ikawa <jikawa@amazon.com>

Co-authored-by: Emerson Knapp <537409+emersonknapp@users.noreply.github.com>
@Crcodlus
Copy link

I tested Fix HTTP for kernel < 4.16 (#2132) with kernel version 4.2.0
During parameter updates (rosparam load sample.yaml), the memory usage of rosmaster unfortunately increases continuously and the memory is not going to be released anymore.

Further info:

  • rosmaster version: 1.15.11
  • param file has about 2000 parameters
  • setting single parameters with rosparam set has no impact to memory usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants