Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delegate port generation for the tests to the OS #1192

Closed

Conversation

MVrachev
Copy link
Collaborator

@MVrachev MVrachev commented Oct 28, 2020

Fixes #1124, Fixes #1111

Description of the changes being introduced by the pull request:

This pr aims to make the creation of new server subprocesses more stable.
It fixes issues related to the usage of ports already in use like
[Errno 98] Address already in use or ConnectionRefusedError: [Errno 111]

By using 0 as a port argument when we start the servers we ask the OS to give us
an arbitrary unused port.
This method is much better than if we generate a port, because
instead of us generating a port, verifying that it's not busy, and use a retry mechanism (if the port is busy)
we can simply rely on the OS to do that job for us.

This means that the server scripts receive their ports when they are created and then send it back to the
father process which had created them.

Other changes/cleanups in this pr include

  • remove test_slow_retrieval expected failure test
  • make sure there is no place where we are generating server ports outside the server scripts.
  • remove port argument from TestServerProcess in tests/utils.py
  • remove unused imports
  • remove try/except at server test files

This pr is related to the closed #1169 pr but because
I decided to use a whole different approach I thought it made more sense to close that one and create this one.

PS: Great thanks to @jku with whom we had discussions about this pr on regular occasions!

Please verify and check that the pull request fulfills the following
requirements
:

  • The code follows the Code Style Guidelines
  • Tests have been added for the bug fix or new feature
  • Docs have been added for the bug fix or new feature

Remove the test with mode 2 ('mode_2': During the download process,
the server blocks the download by sending just several characters
every few seconds.) from test_slow_retrieval.

This test is marked as "expected failure" with the purpose of
rewriting it one day, but slow retrievals have been removed from
the specification and soon it will be removed from the tuf
reference implementation as a whole.
That means that the chances of making this test useful are close
to 0 if not none.

The other test (with mode 1) in test_slow_retrieval is not removed.

For reference:
- theupdateframework/specification#111
- theupdateframework#1156

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
By giving 0 to the port argument we ask the OS to give us
an arbitrary unused port.
This method is much better than if we generate a port, because
even though we are generating a random port there is always
the chance that the port could be already in use.

In this commit, I also make sure that there is no place in the tests
where we are manually generating and passing ports.

Also, because none of the tests pass a port to the TestServerProcess
class and its __init__ function I have removed the "port" argument.

Finally, until now slow_retrieval.py couldn't use the
TestServerProcess class from utils.py port generation because we were
using httpd.handle_request() which handles only ONE request.
Then, what happened was that when we use wait_for_server() to make
a test connection and verify that the server is up,
the slow_retrieval server handles that connection
(which it accepts as a request) and exits.

We avoided that use-case by passing timeout = 0 and avoiding
calling wait_for_server() on this special value.
Now, when we use httpd.serve_forever() this problem is resolved
and no longer we need to make those checks.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Now, after we can use wait_for_server and the retry mechanism
of TestServerProcess in utils.py we no longer need to use
sleep in this test file.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
We want to make sure that server are successfully started in
the common use cases and that the new port generation works.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
@MVrachev
Copy link
Collaborator Author

MVrachev commented Oct 28, 2020

I had some discussions with @jku should we add tests for the broken cases when we create a server process.
For example, when we don't receive the expected message in the temp log file from the server or the server process has just died, etc.
The problem is, that this would mean to wait for the whole timeout to finish and wait for Timeout exception to be raised.
We won't want to use the default timeout of 10 seconds for each of those tests given that the server process is started for like 0.10 ~ 0.20 seconds on a successful run on my VM.
On another hand, we don't want to make the timeout too low, because Travis CI or AppVeyor could be too slow to start the process and not start it at all in which case when we receive Timeout exception we haven't tested anything...

One solution I can think of is using the default timeout and moving these tests to a different dir which will be executed only by
Travis CI and AppVeyor. Those tests are not exactly unit tests but more of integration tests, so for me, it sounds logical.

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! The only hack here is parsing server logs to get the port but I believe that's a decent price to pay (and arguably less of a hack than retrying on bind() failure). Tests should be more reliable and faster because of this -- and not just because one very slow test was removed.

Only real question is why _start_server() still has two loops: I thought the point was that the outer loop was replaced by letting bind() pick a port...

tests/test_utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
@MVrachev
Copy link
Collaborator Author

I like this! The only hack here is parsing server logs to get the port but I believe that's a decent price to pay (and arguably less of a hack than retrying on bind() failure). Tests should be more reliable and faster because of this -- and not just because one very slow test was removed.

Only real question is why _start_server() still has two loops: I thought the point was that the outer loop was replaced by letting bind() pick a port...

Yes, I was thinking that if we have a problem starting the server we can try again if we haven't finished the timeout?
Or probably it's best just to fail because it won't be a problem related to already taken port...

@jku
Copy link
Member

jku commented Oct 28, 2020

Or probably it's best just to fail because it won't be a problem related to already taken port...

Yes: we have no idea why it might have failed so no reason to believe it would be fixed by retrying.

@jku
Copy link
Member

jku commented Oct 28, 2020

Oh one more thing about the new tests: Apart from the clean-test I think everything else is already tested by the current tests. The new tests are obviously more "unit-testy" since they only test a single thing ... but still: is there value in adding tests that are basically subsets of existing tests?

@MVrachev
Copy link
Collaborator Author

MVrachev commented Oct 29, 2020

Oh one more thing about the new tests: Apart from the clean-test I think everything else is already tested by the current tests. The new tests are obviously more "unit-testy" since they only test a single thing ... but still: is there value in adding tests that are basically subsets of existing tests?

They provide us a way to clearly see when we have a bug while starting any of the subprocesses running a server script, which is not always easy to be seen given that most of those server subprocesses are initialized in setUpClass where we can't receive stdout or stderr.

The tests on my virtual machine are run for around 0.5 - 1.0 seconds and with python2 (or none of the tests is skipped) is the same.

@trishankatdatadog
Copy link
Member

Just wanted to say: keep up the great work, y'all! cc @sechkova @joshuagl

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close... LGTM except for the loop removal that seems incorrect (indent change needed for the elapsed time calculation)

tests/utils.py Outdated Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
@MVrachev MVrachev force-pushed the change-port-generation branch 2 times, most recently from d260a70 to de8b896 Compare October 29, 2020 14:14
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
@MVrachev
Copy link
Collaborator Author

I realized that now when we had removed the outer loop I can add tests that check the case when a server has exited before the timeout has expired.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
tests/simple_server.py Outdated Show resolved Hide resolved
As discussed with Jussi, using try and except blocks when instantiating
the servers don't bring a lot of value (the "bind failed") message
could be useful, but there are other messages explaining almost
the same thing which will be logged from the father process as well.

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man from Del Monte says yes 👍

Signed-off-by: Martin Vrachev <mvrachev@vmware.com>
@@ -37,6 +37,7 @@
try:
# is defined in Python 3
TimeoutError
ChildProcessError
Copy link
Member

@jku jku Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work:

  • if TimeoutError fails, you define class ChildProcessError?
  • the code that uses this is in another file and uses utils.ChildProcessError -- that won't be defined in python 3

Copy link
Member

@jku jku Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to just always define an error of your own since ChildProcessError is not in python2 (it's not 100% correct even in python3): TestServerError or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both TimeoutError and ChildProcessError are available on python 3.X and not available on python 2.X, so if TimeoutError is not found then it makes sense to define both of them.

the code that uses this is in another file and uses utils.ChildProcessError -- that won't be defined in python 3

for that you are right. I was mistaken in my assumption.

I think that ChildProcessError is a suitable name for that error.
I just think that in test_utils.py I will fix that error with:

    child_process_error = None
    if sys.version_info.major == 2:
      child_process_error = utils.ChildProcessError
    else:
      child_process_error = ChildProcessError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both TimeoutError and ChildProcessError are available on python 3.X and not available on python 2.X, so if TimeoutError is not found then it makes sense to define both of them.

no. You can either decide things based on python version or you can decide things based on feature testing. Using feature testing as if it is python version testing is wrong even if it happens to work right now.

@jku
Copy link
Member

jku commented Oct 30, 2020

Since the failures are now hidden in the UI: Earlier test runs (e.g. https://ci.appveyor.com/project/theupdateframework/tuf/builds/36033643/job/0pm05aeiihcn9nad for commit 9e53c4a) had error like this:

======================================================================
ERROR: test_get_valid_targetinfo (test_updater.TestMultiRepoUpdater)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\projects\tuf\tests\test_updater.py", line 1875, in setUp
    server=self.SIMPLE_SERVER_PATH, popen_cwd=self.repository_directory2)
  File "C:\projects\tuf\tests\utils.py", line 165, in __init__
    raise e
TimeoutError: u'Failure during C:\\projects\\tuf\\tests\\simple_server.py startup!'

That's worrying and needs to be investigated.

@MVrachev
Copy link
Collaborator Author

MVrachev commented Nov 4, 2020

After the findings made by @jku and described here #1196
I made changes to the Delegate port generation for the tests to the OS commit following his idea about using a thread-safe Queue for process communication and also had to squash a few other commits.

That's why it makes sense to close this pr and open another pr which uses this new method.
Of course, I could reuse this pr, but given that I have rebased all of the commits and I will have to change the pr message together with some of the commit messages I think this would be more confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants