tests (simple_server.py) still unreliable #1111

jku · 2020-08-17T15:53:30Z

I was hoping the unreliability would mostly disappear with #1096 but this does not seem to be the case:

sometimes one test out of 213 fails: The test server startup takes longer than 10s and timeouts (example: https://travis-ci.org/github/theupdateframework/tuf/jobs/718616338)
the test set run times seem very predictable, 213 tests in 54-57 seconds: it seems unlikely that server startups would commonly take multiple seconds, let alone 10 seconds.
A failing test run is 64-65seconds: very close to a normal test run plus 10 seconds for the timeout
On a failing test run, only one test fails

Based on the timing info above my hypothesis is that the server startup is not actually that slow: it's just that sometimes it's not going to happen at all. Could it be that there is/was something running on the chosen port already, and for some reason we don't get EADDRINUSE but end up waiting for a long time?

jku · 2020-08-17T15:59:35Z

Possible workaround: add a def start_server(server='simple_server.py', port=0, timeout=10, retries=3) to utils.py, that

takes care of randomizing the port (and returns the port number along with the new subprocess)
handles waiting for server to start
retries the server start on timeout

This does not fix the underlying issue (whatever it is) but I think it would effectively workaround it: I've never seen a case where two server startups failed in a row so retrying should alleviate the issue

jku · 2020-08-18T10:44:22Z

Chatted with Martin, current plan is:

Martin looks at creating a TestServerProcess abstraction as part of Log subproceses stdout and stderr in temp files #1104 (to avoid repetition)
This will be useful in trying this workaround (as TestServerProcess can then handle the port randomization and retries)

jku · 2020-08-18T16:32:01Z

More details from https://travis-ci.org/github/theupdateframework/tuf/jobs/718988503 (this one happens to print server output):

Traceback (most recent call last):
  File "simple_server.py", line 85, in <module>
    httpd = six.moves.socketserver.TCPServer(('', PORT), handler)
  File "/opt/python/3.6.7/lib/python3.6/socketserver.py", line 453, in __init__
    self.server_bind()
  File "/opt/python/3.6.7/lib/python3.6/socketserver.py", line 467, in server_bind
    self.socket.bind(self.server_address)
OSError: [Errno 98] Address already in use
E..
======================================================================
ERROR: setUpClass (test_updater_root_rotation_integration.TestUpdater)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/theupdateframework/tuf/tests/test_updater_root_rotation_integration.py", line 98, in setUpClass
    utils.wait_for_server('localhost', cls.SERVER_PORT)
  File "/home/travis/build/theupdateframework/tuf/tests/utils.py", line 74, in wait_for_server
    raise TimeoutError
TimeoutError

So it seems to be the child process getting EADDRINUSE! (can't be 100% sure before #1104 is fixed because we don't know for sure which test the simple_server.py output is from. We also don't know if this is happening in all error cases as the output is not collected everywhere) . Will have to think if there's a better way to solve this but the workaround I proposed should definitely work.

jku · 2020-09-23T06:58:36Z

There is another case to take into account:

server fails to start with EADDRINUSE
but wait_for_server() succeeds because another server already responds on that port

The only way to solve this case reliably is that the code starting the server process must get a message from the server process that says "bind succeeded". The simple way to do this would be

make sure all server processes log a specific message
in the code that starts the server, wait until the message is seen
if the message does not appear and process exits instead, try to restart server with a new port

jku changed the title ~~test (simple_server.py) still unreliable~~ tests (simple_server.py) still unreliable Aug 17, 2020

jku mentioned this issue Aug 18, 2020

Log subproceses stdout and stderr in temp files #1104

Merged

3 tasks

joshuagl added the testing label Sep 10, 2020

MVrachev mentioned this issue Sep 23, 2020

Add a method to create a server subprocesses only on unused local ports #1124

Closed

MVrachev mentioned this issue Oct 8, 2020

Add a retry mechanism on server startup failure and use TestServerProcess for port generation #1169

Closed

3 tasks

MVrachev mentioned this issue Oct 28, 2020

Delegate port generation for the tests to the OS #1192

Closed

3 tasks

MVrachev mentioned this issue Nov 4, 2020

Tests: Use Queue for process communication which replaces tmp files and use OS for port creation #1198

Merged

3 tasks

joshuagl closed this as completed in #1198 Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests (simple_server.py) still unreliable #1111

tests (simple_server.py) still unreliable #1111

jku commented Aug 17, 2020 •

edited

Loading

jku commented Aug 17, 2020 •

edited

Loading

jku commented Aug 18, 2020

jku commented Aug 18, 2020 •

edited

Loading

jku commented Sep 23, 2020

tests (simple_server.py) still unreliable #1111

tests (simple_server.py) still unreliable #1111

Comments

jku commented Aug 17, 2020 • edited Loading

jku commented Aug 17, 2020 • edited Loading

jku commented Aug 18, 2020

jku commented Aug 18, 2020 • edited Loading

jku commented Sep 23, 2020

jku commented Aug 17, 2020 •

edited

Loading

jku commented Aug 17, 2020 •

edited

Loading

jku commented Aug 18, 2020 •

edited

Loading