Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests (simple_server.py) still unreliable #1111

Closed
jku opened this issue Aug 17, 2020 · 4 comments · Fixed by #1198
Closed

tests (simple_server.py) still unreliable #1111

jku opened this issue Aug 17, 2020 · 4 comments · Fixed by #1198
Labels

Comments

@jku
Copy link
Member

jku commented Aug 17, 2020

I was hoping the unreliability would mostly disappear with #1096 but this does not seem to be the case:

  • sometimes one test out of 213 fails: The test server startup takes longer than 10s and timeouts (example: https://travis-ci.org/github/theupdateframework/tuf/jobs/718616338)
  • the test set run times seem very predictable, 213 tests in 54-57 seconds: it seems unlikely that server startups would commonly take multiple seconds, let alone 10 seconds.
  • A failing test run is 64-65seconds: very close to a normal test run plus 10 seconds for the timeout
  • On a failing test run, only one test fails

Based on the timing info above my hypothesis is that the server startup is not actually that slow: it's just that sometimes it's not going to happen at all. Could it be that there is/was something running on the chosen port already, and for some reason we don't get EADDRINUSE but end up waiting for a long time?

@jku
Copy link
Member Author

jku commented Aug 17, 2020

Possible workaround: add a def start_server(server='simple_server.py', port=0, timeout=10, retries=3) to utils.py, that

  • takes care of randomizing the port (and returns the port number along with the new subprocess)
  • handles waiting for server to start
  • retries the server start on timeout

This does not fix the underlying issue (whatever it is) but I think it would effectively workaround it: I've never seen a case where two server startups failed in a row so retrying should alleviate the issue

@jku jku changed the title test (simple_server.py) still unreliable tests (simple_server.py) still unreliable Aug 17, 2020
@jku
Copy link
Member Author

jku commented Aug 18, 2020

Chatted with Martin, current plan is:

@jku
Copy link
Member Author

jku commented Aug 18, 2020

More details from https://travis-ci.org/github/theupdateframework/tuf/jobs/718988503 (this one happens to print server output):

Traceback (most recent call last):
  File "simple_server.py", line 85, in <module>
    httpd = six.moves.socketserver.TCPServer(('', PORT), handler)
  File "/opt/python/3.6.7/lib/python3.6/socketserver.py", line 453, in __init__
    self.server_bind()
  File "/opt/python/3.6.7/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
E..
======================================================================
ERROR: setUpClass (test_updater_root_rotation_integration.TestUpdater)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/theupdateframework/tuf/tests/test_updater_root_rotation_integration.py", line 98, in setUpClass
    utils.wait_for_server('localhost', cls.SERVER_PORT)
  File "/home/travis/build/theupdateframework/tuf/tests/utils.py", line 74, in wait_for_server
    raise TimeoutError
TimeoutError

So it seems to be the child process getting EADDRINUSE! (can't be 100% sure before #1104 is fixed because we don't know for sure which test the simple_server.py output is from. We also don't know if this is happening in all error cases as the output is not collected everywhere) . Will have to think if there's a better way to solve this but the workaround I proposed should definitely work.

@jku
Copy link
Member Author

jku commented Sep 23, 2020

There is another case to take into account:

  • server fails to start with EADDRINUSE
  • but wait_for_server() succeeds because another server already responds on that port

The only way to solve this case reliably is that the code starting the server process must get a message from the server process that says "bind succeeded". The simple way to do this would be

  • make sure all server processes log a specific message
  • in the code that starts the server, wait until the message is seen
  • if the message does not appear and process exits instead, try to restart server with a new port

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment