Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test performance and reliability fixes #1096

Merged

Conversation

jku
Copy link
Member

@jku jku commented Aug 4, 2020

Improve reliability and speed of the tests: altogether a full test run is > 40% faster on my laptop and should be more reliable as well (there's still a possibility that e.g. the new wait_for_server() is not handling some exceptions it should but the potential issues should be resolvable over time -- unlike sleep() that is inherently unreliable).

Also add a missing import in download.py and close some unclosed handles that make it difficult to read test output

Fixes issue #1085


There's quite a bit of repetition in e.g. the simple-server using tests... I didn't try to refactor that at the same time but it might make sense to have a SimpleServerTestMixin that does the required initialization and cleanup.

Jussi Kukkonen added 2 commits August 3, 2020 19:06
If a test happens to use the same port as a previous test run
(either by bad luck or hardcoding like TestMultiRepoUpdater) that
happened within a minute, the second run will fail because TCP by
default keeps sockets open for a while.

Avoid this by explicitly saying re-use is fine in this case.

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
This seems to have only worked because someone else usually imports
formats.

This fixes e.g.
  python3 test_download.py \
     TestDownload.test_download_url_to_tempfileobj

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
@jku
Copy link
Member Author

jku commented Aug 4, 2020

Interested in seeing the CI results... Even if they're green we might want to have more CI runs to make sure there are no flaky tests: based on git archeology this bit of code has been very prone to those.

EDIT: obviously a bit of work left :|

@jku
Copy link
Member Author

jku commented Aug 4, 2020

MSG_DONTWAIT does not exist on windows apparently: will have to set the socket to non-blocking mode explicitly

@jku
Copy link
Member Author

jku commented Aug 4, 2020

The other issue is python2.7:

  • can't use with socket.socket(..) as sock:
  • BlockingIOError and ConnectionRefusedError are just OSError: would have to check errno to handle them properly

This avoids the need to sleep() before removing the temporary
directories used, and makes sure we don't get
    ResourceWarning: subprocess N is still running
messages. Use subprocess.communicate() instead of wait() if the process
has a pipe (currently the return values are just dropped though).

Practical results should be more reliability and a slightly reduced
test runtime.

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
@jku jku changed the title Test performance and reliability fixes WIP: Test performance and reliability fixes Aug 4, 2020
@jku jku force-pushed the test-perf-and-reliability-fixes branch from fa0e30c to d33b8c4 Compare August 4, 2020 13:24
@jku
Copy link
Member Author

jku commented Aug 4, 2020

Well the windows results imply that Windows is not fine with my hack of calling connect() multiple times... I guess the by-the-book method would be to select() after getting the first EINPROGRESS (I'm assuming that's received first on Windows, just guessing though)

@trishankatdatadog
Copy link
Member

@jku would you please kindly mark this as a draft PR until ready? Thanks!

@jku jku marked this pull request as draft August 4, 2020 20:36
@jku jku force-pushed the test-perf-and-reliability-fixes branch 3 times, most recently from b26c4a4 to f6bfaf3 Compare August 5, 2020 12:58
@jku
Copy link
Member Author

jku commented Aug 5, 2020

Okay, should be ready for review now.

My original idea of non-blocking connect() did not work on Windows at all: Latest version is using a blocking socket but luckily that's fast on Linux and not slow on Windows ("Connection Refused" can unfortunately take even 2 seconds on Windows because it does retries internally).

I've also bumped the timeout to 10 seconds since a) it does not increase the normal runtime at all and b) Travis actually managed to hit the 5 second mark once. I'm a little worried about Travis not managing to start simple_server.py in 5 seconds though... Doing some more CI runs might not be a bad idea: the full test runtime gives a rough indication of whether there were any really long waits.

@jku jku marked this pull request as ready for review August 5, 2020 13:26
@jku jku changed the title WIP: Test performance and reliability fixes Test performance and reliability fixes Aug 5, 2020
@joshuagl
Copy link
Member

joshuagl commented Aug 5, 2020

py35 on Travis CI failed the first time, which only indicates that the tests are still a little flakey.

@jku
Copy link
Member Author

jku commented Aug 5, 2020

py35 on Travis CI failed the first time, which only indicates that the tests are still a little flakey.

That was a fossa upload failure I think (but I don't know how to view that build anymore).

@jku
Copy link
Member Author

jku commented Aug 5, 2020

WRT test speed:

  • travis run goes from roughly 16min 30s to 12min 20s
  • appveyor (Windows) from roughly 29 mins to 27 mins

Copy link
Member

@joshuagl joshuagl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic PR, thanks @jku. Very happy to see sleep calls being removed!

I was expecting to see some import time lines deleted too, are we still using time in all of the tests that are no longer calling time.sleep?

tests/test_key_revocation_integration.py Outdated Show resolved Hide resolved
tests/test_proxy_use.py Outdated Show resolved Hide resolved
@jku
Copy link
Member Author

jku commented Aug 7, 2020

Thanks @joshuagl all relevant comments. I'll push these changes one by one to get a few more CI runs, will mark the comments resolved after everything is pushed

@jku jku force-pushed the test-perf-and-reliability-fixes branch 2 times, most recently from f3f03ee to c210883 Compare August 7, 2020 14:24
@jku
Copy link
Member Author

jku commented Aug 7, 2020

Last commit is a fixup for an early commit: I'll squash it once the 2nd review is there (or I can do it now, just let me know).

There are now 7 successive green CI runs for this branch: I feel confident in saying the reliability is on par or better than current develop branch.

@joshuagl
Copy link
Member

Excellent set of changes, thanks Jussi. The PR looks good to me, please rebase and squash/fixup and we'll be ready to approve.

Jussi Kukkonen added 4 commits August 10, 2020 16:11
* Add utility function to wait on a socket until it responds
* Use the function instead of sleeping in tests that need to wait for
  the server to start
* Increase the max timeout to 5 seconds by default (as appveyor builds
  still seem to hit the 3 second mark sometimes)

wait_for_server() functions quite differently depending on OS: Windows
can take 2 seconds to respond with ECONNREFUSED whereas Linux is almost
instant. There might be tricks to be faster on Windows (like setting
a shorter socket timeout) but this was not done here.

This makes a full Linux test run almost 40% faster and should be more
reproducible on every platform.

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
This silences "ResourceWarning: unclosed file"

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
Silences "ResourceWarning: unclosed file"

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
Jussi Kukkonen added 4 commits August 10, 2020 16:11
Travis managed to still timeout with 5 seconds: increasing the timeout
(now that it's not a sleep) doesn't really hurt normal non-VM use cases
so let's bump it to 10 seconds.

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
Our tests already expect localhost lookup to work to find test servers:
use it consistently instead of sometimes using 127.0.0.1

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
proxy_server.py is python2 only, don't try to setup the class as it
leads to a confusing TimeoutError.

This may prevent me from debugging this apparent test failure a third
time.

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>
@jku jku force-pushed the test-perf-and-reliability-fixes branch from c210883 to b6661e0 Compare August 10, 2020 13:14
@jku
Copy link
Member Author

jku commented Aug 12, 2020

This PR is ready to go from my perspective, in case that was unclear.

Copy link
Member

@joshuagl joshuagl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these great improvements to the test reliability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants