Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated instances of gz cause gazebo to throw an instance of 'gazebo::common::Exception' #3341

Closed
Maxx84 opened this issue Aug 30, 2023 · 1 comment · Fixed by #3374
Closed
Labels
bug Something isn't working

Comments

@Maxx84
Copy link

Maxx84 commented Aug 30, 2023

Environment

  • OS Version: Ubuntu 20.04
  • Source or binary build?
    Source, gazebo11 branch, version 11.12.0 (I don't have access to the specific commit)

Description

  • Expected behavior: I needed to automate stepping through a simulation, so I repeatedly called gz world -s. I expected to be able to run this indefinitely.
  • Actual behavior: After a while, gazebo crashes, printing the following message:
getifaddres: Too many open files
terminate called after throwing an instance of 'gazebo::common::Exception'

Over time, there is an increasing number of CLOSE_WAIT sockets, which are likely what is causing gazebo to crash.

Steps to reproduce

  1. In a terminal, launch gazebo in paused mode: gazebo -u
  2. In another terminal, call gz world -s over and over. See the example bash script below.
  3. (optional) In another terminal, look at the output of lsof | grep gz | grep CLOSE_WAIT over time (it will grow in size).
  4. After a while, gazebo will crash, printing the message reported above. On my machine, this happens after approximately 1600 iterations.

SAMPLE SCRIPT:

#!/bin/bash

for ((i=1; i<=10000; i++)); do
    echo "Running iteration $i"
    gz world -s
done

Output

Terminal output:

$ gazebo -u
getifaddres: Too many open files
terminate called after throwing an instance of 'gazebo::common::Exception'

Sample of lsof | grep gz | grep CLOSE_WAIT output:

gzserver  2322564 2322647 gzserver             mvespign 1003u     IPv4           39459013       0t0        TCP 192.168.1.181:41048->192.168.1.181:43069 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1004u     IPv4           39453459       0t0        TCP 192.168.1.181:34440->192.168.1.181:40511 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1005u     IPv4           39451487       0t0        TCP 192.168.1.181:33706->192.168.1.181:38207 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1006u     IPv4           39457185       0t0        TCP 192.168.1.181:48184->192.168.1.181:46273 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1007u     IPv4           39454290       0t0        TCP 192.168.1.181:39042->192.168.1.181:42529 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1008u     IPv4           39457194       0t0        TCP 192.168.1.181:40466->192.168.1.181:42267 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1009u     IPv4           39450340       0t0        TCP 192.168.1.181:56174->192.168.1.181:44805 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1010u     IPv4           39461939       0t0        TCP 192.168.1.181:35780->192.168.1.181:39701 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1011u     IPv4           39460992       0t0        TCP 192.168.1.181:36128->192.168.1.181:40037 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1012u     IPv4           39450359       0t0        TCP 192.168.1.181:52228->192.168.1.181:45687 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1013u     IPv4           39460122       0t0        TCP 192.168.1.181:56590->192.168.1.181:35705 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1014u     IPv4           39452320       0t0        TCP 192.168.1.181:49086->192.168.1.181:34555 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1015u     IPv4           39455547       0t0        TCP 192.168.1.181:42624->192.168.1.181:44539 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1016u     IPv4           39461992       0t0        TCP 192.168.1.181:52438->192.168.1.181:44573 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1017u     IPv4           39461102       0t0        TCP 192.168.1.181:35066->192.168.1.181:34093 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1018u     IPv4           39461105       0t0        TCP 192.168.1.181:39780->192.168.1.181:43159 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1019u     IPv4           39452428       0t0        TCP 192.168.1.181:54792->192.168.1.181:46255 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1020u     IPv4           39456615       0t0        TCP 192.168.1.181:37186->192.168.1.181:36697 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1021u     IPv4           39456662       0t0        TCP 192.168.1.181:55848->192.168.1.181:38621 (CLOSE_WAIT)
gzserver  2322564 2322647 gzserver             mvespign 1022u     IPv4           39460225       0t0        TCP 192.168.1.181:41270->192.168.1.181:43359 (CLOSE_WAIT)
@Maxx84 Maxx84 added the bug Something isn't working label Aug 30, 2023
scpeters added a commit that referenced this issue Feb 23, 2024
Publish to the new /world_control gz-transport topic
to avoid the issue with too many open files after
running `gz world` repeatedly.
Fixes #3341.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
@scpeters
Copy link
Member

I briefly experimented with improving the shutdown behavior of classic gazebo_transport (using boost Asia) objects but did not find quick results, so I instead modified the gz world command to publish using gz-transport (using ZeroMQ) and added corresponding subscribers to the World and any other classes that currently subscribe to ~/world_control (see #3374). It appears to fix the issue with running gz world repeatedly.

scpeters added a commit that referenced this issue Feb 23, 2024
This fixes the issue with unclosed sockets leading to
"too many open files" errors after repeated runs of
`gz world` by adding a gz-transport /world_control
topic alongside the ~/world_control classic topic.
This works because gz-transport does a better job
closing sockets.

Subscribers to /world_control are added alongside
any existing subscribers to ~/world_control and the
`gz world` tool is changed to publish only on
/world_control.

Fixes #3341.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants