flaky unit test: test_build_linkcheck.test_connect_to_selfsigned_fails #12038

jayaddison · 2024-03-02T16:32:29Z

Describe the bug

The test_connect_to_selfsigned_fails test case in the test_build_linkcheck module appears to fail occasionally on Windows:

================================== FAILURES ===================================

app = <SphinxTestApp buildername='linkcheck'>

    @pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver-https', freshenv=True)
    def test_connect_to_selfsigned_fails(app):
        with https_server(OKHandler):
            app.build()
    
        with open(app.outdir / 'output.json', encoding='utf-8') as fp:
            content = json.load(fp)
        assert content["status"] == "broken"
        assert content["filename"] == "index.rst"
        assert content["lineno"] == 1
        assert content["uri"] == "https://localhost:7777/"
>       assert "[SSL: CERTIFICATE_VERIFY_FAILED]" in content["info"]
E       assert '[SSL: CERTIFICATE_VERIFY_FAILED]' in "HTTPSConnectionPool(host='localhost', port=7777): Read timed out. (read timeout=0.05)"

tests\test_builders\test_build_linkcheck.py:639: AssertionError

How to Reproduce

See https://github.com/sphinx-doc/sphinx/actions/runs/8080885896/job/22078226984 for an example failure during a pull request -- seemingly without any failure-related changes -- from common base commit 9a30ca7 of the master branch.

Environment Information

N/A

Sphinx extensions

N/A

Additional context

A timeout of 0.05s seems plenty to make a connection to a local HTTPS server that uses a self-signed certificate. However from prior experience of timeouts like this in similar linkcheck tests, there can be some amount of interleaving and conflict between the individual test cases.

The text was updated successfully, but these errors were encountered:

picnixz · 2024-03-02T18:28:07Z

Can you check whether this happened with previous commits? (might be hard to find it in the logs but I hope my fix for the intl flaky tests is not the culprit actually)

jayaddison · 2024-03-02T22:05:57Z

Hm, now that I check, I do see another occurrence of it two weeks ago here: https://github.com/sphinx-doc/sphinx/actions/runs/7902157054/job/21567184034

I'm 99.9% confident that it's nothing to do with your changes; I think that the flakiness bug you fixed was a much more common cause of test failures and is properly resolved.

jayaddison · 2024-03-07T00:33:23Z

The timeouts that are occuring here are reported by the requests library, and are a timeout while waiting to receive response bytes from the server.

There doesn't appear to be any systematic problem during creation of test in-process HTTP(S) servers -- generally that seems to work. For whatever reason, though, this particular test appears flaky. My guess is that it's because it introduces marginally more complexity than some other tests that would otherwise, in theory, be equally affected.

Given the timeout-detection logic, it may be possible to wait for socket-response connectivity before yielding the server thread for testing -- basically, to run a form of test-server healthcheck on the thread to ensure that it's up-and-running before allowing the test suite to communicate with it.

In practice however, that would reduce to introducing another level of timeout; we can't wait forever for that healthcheck, because then the test suite itself could become susceptible to indefinite delays.

So I think what we should do is simply increase the timeout for the affected test. If we see it occur elsewhere perhaps we can increase the timeouts there too -- but I would prefer to localize the changes so that we can more easily determine whether there is indeed any common root cause or whether this is happenstance.

edit: add test server setup source code hyperlink

jayaddison · 2024-03-07T01:02:54Z

In practice however, that would reduce to introducing another level of timeout; we can't wait forever for that healthcheck, because then the test suite itself could become susceptible to indefinite delays.

It seems that simply increasing the timeout for the affected test does not work reliably; other tests are indeed affected too (see test failures in #12050 for commit 396b8df for example).

So perhaps it would be worthwhile to add a healthcheck step, despite the duplication of timeouts. A larger timeout could be configured for the test HTTP(S) server healthcheck/setup phase, with a smaller timeout configured for the Sphinx linkcheck build test cases.

picnixz · 2024-03-07T01:05:29Z

Would it work if we use different ports and not always the 7777? We coule rotate the ports being used from tests to tests.

jayaddison · 2024-03-07T01:08:48Z

Would it work if we use different ports and not always the 7777? We coule rotate the ports being used from tests to tests.

I think that would only make a difference if the operating system waits/retries when binding to that specific port -- my understanding is that most operating systems do not do that, and simply return an error code if the port is already bound (in use). Since we don't see that error occur, the timeout appears to be unrelated.

jayaddison · 2024-03-07T01:25:17Z

(I do think we might want to consider port rotation when parallelizing the tests.. but that's a different story :))

jayaddison · 2024-03-12T15:48:17Z

This test flakiness has reappeared; reopening this issue.

jayaddison · 2024-03-13T10:18:40Z

What would be useful here is a way to reliably make the test failure occur. So far I've found it difficult to determine whether an attempted fix like the one in #12050 completely solved the problem -- it hasn't -- due to the infrequent occurence of the flakiness.

I'm currently out of ideas about how to proceed on this.

jayaddison · 2024-03-13T20:09:43Z

I wonder whether python/cpython#85307 could be related to this.

picnixz · 2024-03-14T08:05:58Z

To make the test fail, create a Windows test and use a parametrize marker to generate a lot of test cases. Then, only run this specific test and check the logs (you can even put prints everywhere and play with the CI/CD).

jayaddison · 2024-03-18T14:25:11Z

I think there's a chance that #12126 may resolve this, despite not being expressly intended to.

jayaddison · 2024-03-23T17:17:16Z

I think there's a chance that #12126 may resolve this, despite not being expressly intended to.

We have not yet merged that pull request (twenty-six). Meanwhile, I have not observed flaky test results due to linkcheck timeouts since #12166 (sixty-six - a workaround: increasing the timeouts) was merged.

jayaddison · 2024-04-13T14:50:04Z

This test flakiness has not re-occurred within the past two weeks, and I believe that the test is now reliable.

In theory I think that we could now safely revert #12166 and reduce the timeout values again - because my sense is that #12126 (isolating the test HTTP server ports), merged afterwards, would also have independently resolved the problem. However, I don't want to risk re-introducing test flakiness around release-time, so I think any further changes to the timeouts should wait.

jayaddison added type:bug type:tests labels Mar 2, 2024

This was referenced Mar 3, 2024

linkcheck builder: add 'linkcheck_report_timeouts_as_broken' config option #12023

Merged

[autodoc] add support for python_display_short_literal #12003

Open

jayaddison added the builder:linkcheck label Mar 7, 2024

jayaddison mentioned this issue Mar 7, 2024

tests: implement readiness check before yielding local-http(s) test servers #12050

Merged

picnixz closed this as completed in #12050 Mar 8, 2024

jayaddison reopened this Mar 12, 2024

jayaddison added the help wanted label Mar 12, 2024

This was referenced Mar 13, 2024

Draft: tests: prevent HTTP server socket re-use on Windows during test suite. #12077

Closed

Revert "tests: implement readiness check before yielding local-http(s) test servers" #12078

Merged

jayaddison mentioned this issue Mar 23, 2024

[tracker] [tests] Python test suite reliability: striving for non-flaky, parallelizable, random-orderable tests. #12191

Open

11 tasks

jayaddison removed the help wanted label Mar 24, 2024

jayaddison closed this as completed Apr 13, 2024

github-actions bot locked as resolved and limited conversation to collaborators May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky unit test: test_build_linkcheck.test_connect_to_selfsigned_fails #12038

flaky unit test: test_build_linkcheck.test_connect_to_selfsigned_fails #12038

jayaddison commented Mar 2, 2024 •

edited

Loading

picnixz commented Mar 2, 2024

jayaddison commented Mar 2, 2024 •

edited

Loading

jayaddison commented Mar 7, 2024 •

edited

Loading

jayaddison commented Mar 7, 2024

picnixz commented Mar 7, 2024 •

edited

Loading

jayaddison commented Mar 7, 2024

jayaddison commented Mar 7, 2024

jayaddison commented Mar 12, 2024

jayaddison commented Mar 13, 2024

jayaddison commented Mar 13, 2024

picnixz commented Mar 14, 2024

jayaddison commented Mar 18, 2024

jayaddison commented Mar 23, 2024

jayaddison commented Apr 13, 2024

flaky unit test: test_build_linkcheck.test_connect_to_selfsigned_fails #12038

flaky unit test: test_build_linkcheck.test_connect_to_selfsigned_fails #12038

Comments

jayaddison commented Mar 2, 2024 • edited Loading

Describe the bug

How to Reproduce

Environment Information

Sphinx extensions

Additional context

picnixz commented Mar 2, 2024

jayaddison commented Mar 2, 2024 • edited Loading

jayaddison commented Mar 7, 2024 • edited Loading

jayaddison commented Mar 7, 2024

picnixz commented Mar 7, 2024 • edited Loading

jayaddison commented Mar 7, 2024

jayaddison commented Mar 7, 2024

jayaddison commented Mar 12, 2024

jayaddison commented Mar 13, 2024

jayaddison commented Mar 13, 2024

picnixz commented Mar 14, 2024

jayaddison commented Mar 18, 2024

jayaddison commented Mar 23, 2024

jayaddison commented Apr 13, 2024

jayaddison commented Mar 2, 2024 •

edited

Loading

jayaddison commented Mar 2, 2024 •

edited

Loading

jayaddison commented Mar 7, 2024 •

edited

Loading

picnixz commented Mar 7, 2024 •

edited

Loading