Test performance and reliability fixes #1096

jku · 2020-08-04T06:54:31Z

Improve reliability and speed of the tests: altogether a full test run is > 40% faster on my laptop and should be more reliable as well (there's still a possibility that e.g. the new wait_for_server() is not handling some exceptions it should but the potential issues should be resolvable over time -- unlike sleep() that is inherently unreliable).

Also add a missing import in download.py and close some unclosed handles that make it difficult to read test output

Fixes issue #1085

There's quite a bit of repetition in e.g. the simple-server using tests... I didn't try to refactor that at the same time but it might make sense to have a SimpleServerTestMixin that does the required initialization and cleanup.

If a test happens to use the same port as a previous test run (either by bad luck or hardcoding like TestMultiRepoUpdater) that happened within a minute, the second run will fail because TCP by default keeps sockets open for a while. Avoid this by explicitly saying re-use is fine in this case. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

This seems to have only worked because someone else usually imports formats. This fixes e.g. python3 test_download.py \ TestDownload.test_download_url_to_tempfileobj Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

jku · 2020-08-04T06:58:42Z

Interested in seeing the CI results... Even if they're green we might want to have more CI runs to make sure there are no flaky tests: based on git archeology this bit of code has been very prone to those.

EDIT: obviously a bit of work left :|

jku · 2020-08-04T07:14:56Z

MSG_DONTWAIT does not exist on windows apparently: will have to set the socket to non-blocking mode explicitly

jku · 2020-08-04T08:14:29Z

The other issue is python2.7:

can't use with socket.socket(..) as sock:
BlockingIOError and ConnectionRefusedError are just OSError: would have to check errno to handle them properly

This avoids the need to sleep() before removing the temporary directories used, and makes sure we don't get ResourceWarning: subprocess N is still running messages. Use subprocess.communicate() instead of wait() if the process has a pipe (currently the return values are just dropped though). Practical results should be more reliability and a slightly reduced test runtime. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

jku · 2020-08-04T19:45:10Z

Well the windows results imply that Windows is not fine with my hack of calling connect() multiple times... I guess the by-the-book method would be to select() after getting the first EINPROGRESS (I'm assuming that's received first on Windows, just guessing though)

trishankatdatadog · 2020-08-04T20:25:18Z

@jku would you please kindly mark this as a draft PR until ready? Thanks!

jku · 2020-08-05T13:25:05Z

Okay, should be ready for review now.

My original idea of non-blocking connect() did not work on Windows at all: Latest version is using a blocking socket but luckily that's fast on Linux and not slow on Windows ("Connection Refused" can unfortunately take even 2 seconds on Windows because it does retries internally).

I've also bumped the timeout to 10 seconds since a) it does not increase the normal runtime at all and b) Travis actually managed to hit the 5 second mark once. I'm a little worried about Travis not managing to start simple_server.py in 5 seconds though... Doing some more CI runs might not be a bad idea: the full test runtime gives a rough indication of whether there were any really long waits.

joshuagl · 2020-08-05T13:44:03Z

py35 on Travis CI failed the first time, which only indicates that the tests are still a little flakey.

jku · 2020-08-05T13:54:33Z

py35 on Travis CI failed the first time, which only indicates that the tests are still a little flakey.

That was a fossa upload failure I think (but I don't know how to view that build anymore).

jku · 2020-08-05T13:59:57Z

WRT test speed:

travis run goes from roughly 16min 30s to 12min 20s
appveyor (Windows) from roughly 29 mins to 27 mins

joshuagl

Fantastic PR, thanks @jku. Very happy to see sleep calls being removed!

I was expecting to see some import time lines deleted too, are we still using time in all of the tests that are no longer calling time.sleep?

tests/test_key_revocation_integration.py

tests/test_proxy_use.py

jku · 2020-08-07T12:08:27Z

Thanks @joshuagl all relevant comments. I'll push these changes one by one to get a few more CI runs, will mark the comments resolved after everything is pushed

jku · 2020-08-07T15:39:50Z

Last commit is a fixup for an early commit: I'll squash it once the 2nd review is there (or I can do it now, just let me know).

There are now 7 successive green CI runs for this branch: I feel confident in saying the reliability is on par or better than current develop branch.

joshuagl · 2020-08-10T10:44:58Z

Excellent set of changes, thanks Jussi. The PR looks good to me, please rebase and squash/fixup and we'll be ready to approve.

* Add utility function to wait on a socket until it responds * Use the function instead of sleeping in tests that need to wait for the server to start * Increase the max timeout to 5 seconds by default (as appveyor builds still seem to hit the 3 second mark sometimes) wait_for_server() functions quite differently depending on OS: Windows can take 2 seconds to respond with ECONNREFUSED whereas Linux is almost instant. There might be tricks to be faster on Windows (like setting a shorter socket timeout) but this was not done here. This makes a full Linux test run almost 40% faster and should be more reproducible on every platform. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

This silences "ResourceWarning: unclosed file" Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Silences "ResourceWarning: unclosed file" Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Travis managed to still timeout with 5 seconds: increasing the timeout (now that it's not a sleep) doesn't really hurt normal non-VM use cases so let's bump it to 10 seconds. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Our tests already expect localhost lookup to work to find test servers: use it consistently instead of sometimes using 127.0.0.1 Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

proxy_server.py is python2 only, don't try to setup the class as it leads to a confusing TimeoutError. This may prevent me from debugging this apparent test failure a third time. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

jku · 2020-08-12T08:01:35Z

This PR is ready to go from my perspective, in case that was unclear.

joshuagl

Thanks for these great improvements to the test reliability.

Jussi Kukkonen added 2 commits August 3, 2020 19:06

download: import tuf.formats before using it

dc9c3b2

This seems to have only worked because someone else usually imports formats. This fixes e.g. python3 test_download.py \ TestDownload.test_download_url_to_tempfileobj Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

jku changed the title ~~Test performance and reliability fixes~~ WIP: Test performance and reliability fixes Aug 4, 2020

jku force-pushed the test-perf-and-reliability-fixes branch from fa0e30c to d33b8c4 Compare August 4, 2020 13:24

jku marked this pull request as draft August 4, 2020 20:36

jku force-pushed the test-perf-and-reliability-fixes branch 3 times, most recently from b26c4a4 to f6bfaf3 Compare August 5, 2020 12:58

jku marked this pull request as ready for review August 5, 2020 13:26

jku changed the title ~~WIP: Test performance and reliability fixes~~ Test performance and reliability fixes Aug 5, 2020

joshuagl reviewed Aug 7, 2020

View reviewed changes

tests/test_key_revocation_integration.py Outdated Show resolved Hide resolved

tests/test_proxy_use.py Outdated Show resolved Hide resolved

jku force-pushed the test-perf-and-reliability-fixes branch 2 times, most recently from f3f03ee to c210883 Compare August 7, 2020 14:24

Jussi Kukkonen added 4 commits August 10, 2020 16:11

test_download: Use logger for output

f468007

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

tests: Remember to close file objects

0dfa170

This silences "ResourceWarning: unclosed file" Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

log: Close the file handler when it's removed

6b1011d

Silences "ResourceWarning: unclosed file" Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

Jussi Kukkonen added 4 commits August 10, 2020 16:11

tests: increase wait_for_server() default timeout

a22f618

Travis managed to still timeout with 5 seconds: increasing the timeout (now that it's not a sleep) doesn't really hurt normal non-VM use cases so let's bump it to 10 seconds. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

tests: Use localhost consistently

fc44652

Our tests already expect localhost lookup to work to find test servers: use it consistently instead of sometimes using 127.0.0.1 Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

test_proxy_use: Error out early if using Python3

c7425be

proxy_server.py is python2 only, don't try to setup the class as it leads to a confusing TimeoutError. This may prevent me from debugging this apparent test failure a third time. Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

tests: Remove unused imports

b6661e0

Signed-off-by: Jussi Kukkonen <jkukkonen@vmware.com>

jku force-pushed the test-perf-and-reliability-fixes branch from c210883 to b6661e0 Compare August 10, 2020 13:14

joshuagl approved these changes Aug 17, 2020

View reviewed changes

joshuagl merged commit 8577039 into theupdateframework:develop Aug 17, 2020

MVrachev mentioned this pull request Sep 25, 2020

libeatmydata for running tests and installing dependencies under Travis CI #596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test performance and reliability fixes #1096

Test performance and reliability fixes #1096

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 4, 2020

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 4, 2020

trishankatdatadog commented Aug 4, 2020

jku commented Aug 5, 2020 •

edited

Loading

joshuagl commented Aug 5, 2020

jku commented Aug 5, 2020

jku commented Aug 5, 2020

joshuagl left a comment

jku commented Aug 7, 2020

jku commented Aug 7, 2020 •

edited

Loading

joshuagl commented Aug 10, 2020

jku commented Aug 12, 2020

joshuagl left a comment

Test performance and reliability fixes #1096

Test performance and reliability fixes #1096

Conversation

jku commented Aug 4, 2020 • edited Loading

jku commented Aug 4, 2020 • edited Loading

jku commented Aug 4, 2020

jku commented Aug 4, 2020 • edited Loading

jku commented Aug 4, 2020

trishankatdatadog commented Aug 4, 2020

jku commented Aug 5, 2020 • edited Loading

joshuagl commented Aug 5, 2020

jku commented Aug 5, 2020

jku commented Aug 5, 2020

joshuagl left a comment

Choose a reason for hiding this comment

jku commented Aug 7, 2020

jku commented Aug 7, 2020 • edited Loading

joshuagl commented Aug 10, 2020

jku commented Aug 12, 2020

joshuagl left a comment

Choose a reason for hiding this comment

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 4, 2020 •

edited

Loading

jku commented Aug 5, 2020 •

edited

Loading

jku commented Aug 7, 2020 •

edited

Loading