Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTEGRATION_world_clone failures can leave rogue gzserver processes #1299

Closed
osrf-migration opened this issue Oct 10, 2014 · 20 comments
Closed
Labels
all bug Something isn't working major testing

Comments

@osrf-migration
Copy link

Original report (archived issue) by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


The INTEGRATION_world_clone test spawns external gzserver processes and then tears them down (see world_clone.cc:199). If the test has an internal failure, it might not reach the point where it terminates the external gzserver process. This has happened two times recently, and it causes all the other tests to fail:

Should we update the tool/check_test_ran.py to kill any external gzserver instances that it sees?

@osrf-migration
Copy link
Author

Original comment by Jose Luis Rivero (Bitbucket: Jose Luis Rivero, GitHub: j-rivero).


It could potentially kill the future options of running all the tests in parallel using different gzservers or run manually gazebo while the test suite is being run. I would prefer to figure out a way of clean up only after the clone test and don't close the door to potential parallelization.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


An alternative that we use when spawning other gzserver instances for tests is to give a finite --iters argument so that the process will terminate after a finite number of steps. I'm not sure the clone API allows for an iterations parameter to be set. @caguero what do you think?

@osrf-migration
Copy link
Author

Original comment by Carlos Agüero (Bitbucket: caguero, GitHub: caguero).


I like Steve's proposal. I can extend the ServerControl message that we've been using for cloning simulations with an extra optional message for setting the number of steps.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


We haven't released the code in the default branch yet, so we could also just change the message if that was easier.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I just had another thought about the first idea of trying to kill rogue gzserver processes. When I ps aux | grep gzserver I see the following for the rogue test:

... gzserver /tmp/clone.11347.world

With some careful filtering, we should be able to terminate gzserver's spawned by a broken world_clone test and leave others intact. How do you feel about that @_jrivero_ ?

With regard to running multiple test suites in parallel, I'm pretty sure two world_clone tests could not be run in parallel since the test has a hard-coded port number. We would need to give each instance a unique number.

@osrf-migration
Copy link
Author

Original comment by Jose Luis Rivero (Bitbucket: Jose Luis Rivero, GitHub: j-rivero).


If we are able to determine which gzserver was spawned by the world_clone test, I think that this would be ideal. Parallel testing is still not on the table, but a nice idea to work on.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


The following would be a workaround until the ServerControl message is modified:

diff -r 5996b049262c05490f6a474833f4f7df92ab01a5 tools/check_test_ran.py
--- a/tools/check_test_ran.py	Fri Oct 10 11:41:12 2014 -0700
+++ b/tools/check_test_ran.py	Tue Oct 14 15:23:37 2014 -0700
@@ -43,6 +43,8 @@
 NAME="check_test_ran.py"
 
 import os
+import re
+import string
 import sys
 import subprocess
 
@@ -107,5 +109,18 @@
         stylesheet = os.path.dirname(os.path.abspath(__file__)) + "/qtest_to_junit.xslt"
         run_xsltproc(stylesheet, test_file)
 
+    # kill rogue gzservers from INTEGRATION_world_clone test
+    # https://osrf-migration.github.io/gazebo-gh-pages/#!/osrf/gazebo/issues/1299 (#1299)
+    if 'INTEGRATION_world_clone' in test_file:
+        process = subprocess.Popen(['ps', 'ax'], stdout=subprocess.PIPE)
+        stdout, stderr = process.communicate()
+        for line in string.split(stdout, '\n'):
+            if line.find('gzserver /tmp/clone.11347.world') >= 0:
+                m = re.search('^([0-9]+) .*', line)
+                print(line)
+                pid = m.group(1)
+                print("killing gzserver with pid %s" % (pid))
+                subprocess.call(["kill", "%s" % (pid)])
+
 if __name__ == '__main__':
     check_main()

To test, apply the following:

diff -r 5996b049262c05490f6a474833f4f7df92ab01a5 test/integration/CMakeLists.txt
--- a/test/integration/CMakeLists.txt	Fri Oct 10 11:41:12 2014 -0700
+++ b/test/integration/CMakeLists.txt	Tue Oct 14 15:23:37 2014 -0700
@@ -92,6 +92,8 @@
   set_tests_properties(${TEST_TYPE}_factory PROPERTIES TIMEOUT 500)
   set_tests_properties(${TEST_TYPE}_physics PROPERTIES TIMEOUT 500)
 
+  set_tests_properties(${TEST_TYPE}_world_clone PROPERTIES TIMEOUT 10)
+
   # Add plugin dependency
   add_dependencies(${TEST_TYPE}_joint_test SpringTestPlugin)
 endif()
diff -r 5996b049262c05490f6a474833f4f7df92ab01a5 test/integration/world_clone.cc
--- a/test/integration/world_clone.cc	Fri Oct 10 11:41:12 2014 -0700
+++ b/test/integration/world_clone.cc	Tue Oct 14 15:23:37 2014 -0700
@@ -195,7 +195,7 @@
   // Check that the cloned world contains the camera topics.
   output = custom_exec_str("gz topic -l");
   EXPECT_NE(output.find("/gazebo/default/camera/"), std::string::npos);
-
+FAIL();
   // Kill the cloned server. In the case of no presence of gzserver ps will
   // return the own ps process so it probably will do nothing. No effect.
   // This should work on Linux and Mac

and run this command:

$ make test ARGS="-VV -R INTEGRATION_world_clone"
...
250: Checking for test results in /tmp/gazebo_build/source/build/test_results/INTEGRATION_world_clone.xml
250: 12427 pts/13   Sl+    0:01 gzserver /tmp/clone.11347.world
250: killing gzserver with 12427
2/2 Test #250: check_INTEGRATION_world_clone ....   Passed    0.04 sec
...
$ ps ax | grep 'gzserver /tmp/clone.11347.world'

I'll make a pull request for this when I have a chance.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


pull request #1232

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


  • changed state from "new" to "resolved"

pull request #1232

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


It looks like there was another failure even after the fix from pull request #1232:

http://build.osrfoundation.org/job/gazebo-default-devel-trusty-amd64-gpu-nvidia/127/consoleFull

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


  • changed state from "resolved" to "open"

This wasn't fixed entirely, since there is a bug in pull request #1232

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I found the bug from a test log on my machine, viewable at https://gist.github.com/scpeters/a355665271490e81c381

There seems to be an issue with my regex for extracting the process id (pid), which was assuming no whitespace at the beginning of the line (which can happen actually if the pid has a small number of digits compared to other pid's). Fix is forthcoming in branch issue_1299.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


pull request #1334

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


  • changed state from "open" to "resolved"

for the second time: pull request #1334

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


I just saw another failure. I had to kill -9 gzserver to resolve it.

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


  • changed state from "resolved" to "open"

we need to do kill -9 gzserver

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


see pull request #1443

@osrf-migration
Copy link
Author

Original comment by Steve Peters (Bitbucket: Steven Peters, GitHub: scpeters).


  • changed state from "open" to "resolved"

pull request #1443

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  • set version to "all"

@osrf-migration
Copy link
Author

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  • changed state from "resolved" to "closed"

@osrf-migration osrf-migration added major testing bug Something isn't working all labels Apr 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
all bug Something isn't working major testing
Projects
None yet
Development

No branches or pull requests

1 participant