Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a retry mechanism on server startup failure and use TestServerProcess for port generation #1169

Closed

Conversation

MVrachev
Copy link
Collaborator

@MVrachev MVrachev commented Oct 8, 2020

Fixes #1124, Fixes #1111

Description of the changes being introduced by the pull request:

Adds a retry mechanism on server startup failure. This mechanism is useful for the tests
which are using server subprocesses and if any of the ports they are trying to connect on is already taken.
The other changes in this pr make sure we are using only TestServerProcess for port generation and finally
I removed the port argument as a whole from the TestServerProcess constructor.

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

@MVrachev
Copy link
Collaborator Author

MVrachev commented Oct 8, 2020

I was wondering should we remove wait_for_server() function calls and the function itself if we are to merge this pr?
We have a mechanism to check if a server has successfully connected, so what advantages do we get from calling wait_for_server() after those checks?
What do you think @jku @joshuagl?

PS: Can somebody restart Travis CI? It failed with a strange error...

Copy link
Member

@joshuagl joshuagl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work here, thanks. I have some comments and suggestions inline.

tests/proxy_server.py Outdated Show resolved Hide resolved
tests/simple_https_server.py Outdated Show resolved Hide resolved
tests/simple_server.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/test_updater.py Outdated Show resolved Hide resolved
tests/test_slow_retrieval_attack.py Outdated Show resolved Hide resolved
tests/test_slow_retrieval_attack.py Outdated Show resolved Hide resolved
@MVrachev MVrachev force-pushed the generate-free-ports branch 2 times, most recently from f6aef32 to 392d4cd Compare October 14, 2020 13:10
@MVrachev
Copy link
Collaborator Author

Updated the commits with all of your suggestions @joshuagl except on one of your comments.
There I hope Jussi can give his opinion as you have suggested.

@jku
Copy link
Member

jku commented Oct 15, 2020

I was wondering should we remove wait_for_server() function calls and the function itself if we are to merge this pr?
We have a mechanism to check if a server has successfully connected, so what advantages do we get from calling wait_for_server() after those checks?

I'm not 100% sure -- maybe successful binding is enough to guarantee that there is a response when the port is tried? It seems reasonable but I would not bet retirement money on that being true on every platform.

Maybe don't remove it now, but add a calendar notification a few weeks after merging. Then we can check CI logs to see whether the combination helped and can remove wait_for_server() as a test to see if your work here is enough alone -- that would be ideal.

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise looks good, left comments about slow retrieval mostly...

tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/test_slow_retrieval_attack.py Outdated Show resolved Hide resolved
tests/test_slow_retrieval_attack.py Outdated Show resolved Hide resolved
tests/test_slow_retrieval_attack.py Outdated Show resolved Hide resolved
@jku
Copy link
Member

jku commented Oct 15, 2020

Commenting to write down results from chat: I think there is probably a slow_retrieval_server.py refactoring that would mean the class separation and the new function are not required at all in test_slow_retrieval_attack.py -- but I'm definitely not saying it has to be done: spending too much time on these tests is probably not a good investment.

tests/utils.py Outdated Show resolved Hide resolved
@joshuagl
Copy link
Member

Commenting to write down results from chat: I think there is probably a slow_retrieval_server.py refactoring that would mean the class separation and the new function are not required at all in test_slow_retrieval_attack.py -- but I'm definitely not saying it has to be done: spending too much time on these tests is probably not a good investment.

Agreed. Particularly as slow retrieval is currently not part of the spec and could be removed from the reference implementation (see #1156).

@MVrachev MVrachev force-pushed the generate-free-ports branch 2 times, most recently from 4f8a15b to d30bb04 Compare October 19, 2020 15:00
@MVrachev
Copy link
Collaborator Author

MVrachev commented Oct 19, 2020

I addressed all of @jku feedbacks.

I rebased and changed the following thinks:

  1. In tests/utils.py I set up a hardcoded amount of possible port generation attempts.

  2. Changed the place of the If the server process has exited check in has_server_started in tests/utils.py.

  3. Removed the expected to fail test from tests/test_slow_retrieval_attack.py. As I said in my modified commit message:

Slow retrievals have been removed from the specification and soon 
it will be removed from the tuf reference implementation as a whole.
This means that the chances of making this test useful are close to 0 if not none.
  1. Because of this comment Add a retry mechanism on server startup failure and use TestServerProcess for port generation #1169 (comment) made by @jku I made a research and found that we creating one additional temporary folder per test class in many of our test files.
    Fixed that in the last commit.

Copy link
Member

@joshuagl joshuagl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice set of changes, thanks Martin!

tests/utils.py Outdated Show resolved Hide resolved
tests/slow_retrieval_server.py Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look at the new commit yet -- it sounds good but also looks like a lot of unrelated changes that are not the easiest to review... If you want to keep it in this PR we can do that but it could be a separate one as well

tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
@MVrachev
Copy link
Collaborator Author

I did not look at the new commit yet -- it sounds good but also looks like a lot of unrelated changes that are not the easiest to review... If you want to keep it in this PR we can do that but it could be a separate one as well

Yea, maybe you are right.
This is a big cleanup and it deserves its own pr and it's maybe hard to review.
Additionally, there could be a discussion it.

This mechanism is useful for the tests which are using server
subprocesses and if any of the ports they are trying to connect on
is already taken.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Remove the test with mode 2 ('mode_2': During the download process,
the server blocks the download by sending just several characters
every few seconds.) from test_slow_retrieval.

This test is marked as "expected failure" with the purpose of
rewriting it one day, but slow retrievals have been removed from
the specification and soon it will be removed from the tuf
reference implementation as a whole.
That means that the chances of making this test useful are close
to 0 if not none.

The other test (with mode 1) in test_slow_retrieval is not removed.

For reference:
- theupdateframework/specification#111
- theupdateframework#1156

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Until now, slow_retrieval.py couldn't use the TestServerProcess class
port generation and retry mechanism from utils.py because we were
using httpd.handle_request() which handles only ONE request.
Then, what happened is that when we use wait_for_server() to make
a test connection and verify that the server is up,
the slow_retrieval server handles that connection
(which it accepts as a request) and exits.

We avoided that use-case by passing timeout = 0 and avoiding
calling wait_for_server() on this special value.
Now, when we use  httpd.serve_forever() this problem is resolved
and no longer we need to make those checks.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
The port argument for __init__  function in TestServerProcess
is no longer needed because none of the tests is using it.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
We want to make sure we generate the server ports only in
TestServerProcess class from utils.py because it has a retry
mechanism on server startup failure which makes it more reliable.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
We added a retry mechanism on server startup failure which
generates a new port and tries to start the server again
until the server is started or the timeout has expired.

We want to make sure that this mechanism is well tested as for
usual cases same for corner cases e.g. when the server hangs on
forever or just exits unexpectedly.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Now, after we can use wait_for_server and the retry mechanism
of TestServerProcess in utils.py we no longer need to use
sleep in this test file.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
@MVrachev
Copy link
Collaborator Author

MVrachev commented Oct 22, 2020

I didn't want to rebase this time, but because I had to change the commit messages I had to.

I found a lot of different small or bigger problems while addressing your comments @joshuagl and @jku.
Thanks for the nice reviews!

There are a lot of changes (for which I am sorry that I didn't submit them in the begging) which I will try to summarize:

  1. Removed the last commit which removed the extra temporary folder. Will submit a new pr for it.

  2. Made utils.py to use one variable as a condition for the two loops.

  3. Moved the startup of the server in utils.py in a separate function.

  4. Added tests for utils.py to make sure that we catch the usual use-cases for our abstraction, but also the corner cases (e.g. when the server hangs on forever or exits unexpectedly).

  5. Added a helper function in utils.py called is_process_running which I then used through utils.py and its tests.

  6. Removed the need to use timeout = 0 and make slow_retrieval.py to use the standard workflow in utils.py.

  7. Made sure that we should use the port generation workflow from utils.py when we run the four server files (e.g. simple_server.py, slow_retrieval.py ...)

  8. Added a custom exception when the port is not provided.

  9. Changed the commit messages for Remove test_slow_retrieval expected failure test and Remove port arg from __init__ in TestServerProcess

  10. Removed sleep from test_slow_retrieval_attack.py . We no longer get TimeoutError without it.

  11. Updated some of the comments am introducing them now and not in the begging), but I will summarize them:

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few specific comments, but I have some general ones as well:

  • I'm not 100% sold on the tests: the idea is good (TestServerProcess is complex enough that it should ideally be tested) but testing this properly is not trivial and the implementation should not mean actual code (e.g. simple_server.py) gets more complex because of it
  • The server startup functionality still smells complex: I think this is the code that will make or break the CI reliability, and it should be both correct and simple enough for everyone to see that it's correct. Not going to details here: let's talk it through later

Comment on lines +52 to +54
class ImproperNumberOfArguments(Exception):
"""Raised when the number of argumnets is wrong."""
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add exceptions if we don't intend to ever handle them (also, server scripts importing the utils.py that starts the server scripts... this does not look right)

# collecting the logs per test.
self.__server_process = subprocess.Popen(command,
stdout=self.__temp_log_file, stderr=subprocess.STDOUT, cwd=popen_cwd)
timeout = timeout - (time.time() - start)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not do what it should :)


# Used for tests in tests/test_utils.py
if len(sys.argv) > 2:
if sys.argv[2] == "stop":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, this script already uses argv[2] for something else. Second: I'm not 100% convinced these new features should be here: If you need this kind of "breaking server", add a new script instead

def is_process_running(self):
"""Returns a boolean value if the server process is currently running."""

return True if self.__server_process.returncode is None else False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after testing and checking docs I found out that returncode does not actually work if poll() or wait() has not been called: so we should just use poll() here instead.

I may have been the one who said "use returncode", sorry about that

@MVrachev
Copy link
Collaborator Author

While addressing the comments made by @jku I read somewhere that we can use a simpler way to generate unused ports as showed here: https://docs.python.org/3/library/socketserver.html#asynchronous-mixins

    # Port 0 means to select an arbitrary unused port
    HOST, PORT = "localhost", 0
    server = ThreadedTCPServer((HOST, PORT), ThreadedTCPRequestHandler)

The same method applies for BaseHTTPServer.HTTPServer, TCPServer and ThreadingHTTPServer which we are using in our server scripts used for testing.

This means that instead of us generating a port, verifying that it's not busy, and using a retry mechanism if the port is busy we can simply start the server like that

six.moves.socketserver.TCPServer(('localhost', 0), handler)

and arbitrary unused port will be chosen for us.

I have implemented that new mechnaism in a different branch.

I will close this pr because there is no longer sense for us to generate a new port and validate it as I did and we discussed in this pr and I will create a new pr where I will push this new branch.

@MVrachev MVrachev closed this Oct 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants