Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

jayaddison · 2023-04-07T09:53:45Z

Describe the bug

The test_linkcheck_allowed_redirects test appears to be failing rarely and intermittently:

Sometimes an unexpected broken result is received instead of working:
- https://github.com/sphinx-doc/sphinx/actions/runs/4630653760/jobs/8192510588#step:10:2378
- https://github.com/sphinx-doc/sphinx/actions/runs/4636993886/jobs/8205438908#step:10:1897
Sometimes a timeout occurs:
- https://github.com/sphinx-doc/sphinx/actions/runs/4628948378/jobs/8188651117#step:10:2393

Other tests within the same linkcheck builder test module also appear to be affected:

test_too_many_requests_retry_after_HTTP_date https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312 (thanks, @picnixz)
test_raw_node https://github.com/jayaddison/sphinx/actions/runs/4697985339/jobs/8329708670?pr=1#step:10:2447

It's possible that they're slightly separate issues, but are likely to be related (and probably can be fixed at the same time).

How to Reproduce

Run the relevant main.yml continuous integration pytest tests on a repeated basis. This is demonstrated by jayaddison/sphinx#1.

Environment Information

GitHub Actions workflows (examples are linked in the bug description section)

Sphinx extensions

N/A

Additional context

No response

The text was updated successfully, but these errors were encountered:

picnixz · 2023-04-07T10:30:33Z

First, it seems that it only affects Python 3.11 and Python 3.12 builds. One possibility that comes up to me is we might have some race conditions or some deadlock. This may be due to the new implementation of the HTTP server with filelocks but I am not entirely sure about it.

In order to reproduce, I think we need to sequentially run Python 3.10, 3.11 and 3.12 builds so that when Python 3.11 build starts, the Python 3.10 build may not necessarily be "properly" finished (perhaps the timeout is too short for acquiring / releasing all locks properly ?). Alternatively, I would suggest checking whether such failures occurs prior to a80e3fd.

jayaddison · 2023-04-07T14:33:11Z

Do you know where the test parallelism is configured/declared, @picnixz? (I was looking around for it, but haven't figured it out yet)

picnixz · 2023-04-07T14:40:01Z

If you want to run it using CI, I think this should in the .github/workflows/main.yml file itself. In order to do it locally, the best way to do is to write a bash script which would sequentially (or concurrently, depending on what you want to test) run the pytest commands themselves.

I am not even sure that the issue is due to deadlocks/race conditions. Maybe it's an issue that we can only see on github for instance. Maybe @marxin has some idea ?

jayaddison · 2023-04-07T14:45:40Z

Ah, ok, understood (I think). So the current codebase doesn't specify parallelism for the tests, but it can be enabled locally, and doing that is what causes the localhost port-binding issue that #11294 addressed.

Maybe an odd idea: could/should we use a separate port for each test that runs a local HTTP server?

picnixz · 2023-04-07T15:03:26Z

Concerning #11294, you are right. The issue is that this might have induced some issues with the Github CI workflow and this is extremely hard to reproduce (like, we don't know if this is because of the underlying CI framework, if it is because of pytest itself or if this is because of the (physical) machine that is running the CI checks).

Using a separate port for each test would be a possibility, but this means that we need to know which port is used and which port is not. There is no guarantee that when a test finishes, then the port number that it was using is then free. In particular, we need to have as many ports as there will be tests but this 1) is prone to errors if we incorrectly choose the port 2) requires to know which ports we can use 3) and may conflict with local ports when we are testing. One possibility is to wait between the execution of two HTTP-related tests but this would increase the duration of the tests (which are already taking quite a long time to run on github).

One possibility is to first concurrently run all non-HTTP tests as much as possible and then run all HTTP-related tests sequentially. This may serve in isolating future network-related issues as well.

AA-Turner · 2023-04-07T17:04:49Z

In order to reproduce, I think we need to sequentially run Python 3.10, 3.11 and 3.12 builds so that when Python 3.11 build starts, the Python 3.10 build may not necessarily be "properly" finished (perhaps the timeout is too short for acquiring / releasing all locks properly ?). Alternatively, I would suggest checking whether such failures occurs prior to a80e3fd.

In GitHub Actions the runners are all independent, there's no shared state between them. I think the likeliest issue is simply that the runners have lower resources (2 cores/etc) and hit the low timeout.

I lowered the timeouts in 97f07ca to reduce the time taken by the linkcheck tests, and had to increase them once in 7d928b3 to fix instability, perhaps we need to increase again.

A

jayaddison · 2023-04-07T17:25:18Z

+1 on the observation about no-shared-state between the runners.

My understanding is that pytest should run tests single-process, sequentially, and without any async, so any temporary webserver + file-locking usage should be setup and torn-down between tests.

It's a good goal to allow running tests with parallelism. I've found the responses library useful for mocking requests calls during tests without requiring a webserver. It's an additional dependency but it could allow factoring-out any reliance on network socket isolation.

jayaddison · 2023-04-08T15:17:04Z

It's a good goal to allow running tests with parallelism. I've found the responses library useful for mocking requests calls during tests without requiring a webserver. It's an additional dependency but it could allow factoring-out any reliance on network socket isolation.

After investigating further, it doesn't look like responses would be applicable to all of the existing HTTP tests in sphinx. In particular, there are tests that exercise TLS/SSL-level behaviour (like this one, for example), and responses does not provide mocking at that level (only per HTTP-request).

picnixz · 2023-04-11T11:10:47Z

Another "broken" test: https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312

jayaddison · 2023-04-13T17:11:24Z

The file locking approach seems fine to me, and upping the timeout seems OK, but it would be nice to find the cause.

Looking at the HTTP server logs in the case of the test_linkcheck_allowed_redirects failures could be instructive. Here's one example (with some of the assertion failure context).

>       assert records["http://localhost:7777/path1"]["status"] == "working"
E       AssertionError: assert 'broken' == 'working'
E         - working
E         + broken

tests/test_build_linkcheck.py:353: AssertionError
----------------------------- Captured stderr call -----------------------------
127.0.0.1 - - [] "HEAD /path2 HTTP/1.1" 405 -
127.0.0.1 - - [] "HEAD /path1 HTTP/1.1" 405 -
127.0.0.1 - - [] "GET /path2 HTTP/1.1" 302 -
127.0.0.1 - - [] "GET /?redirected=1 HTTP/1.1" 204 -

So it seems like the HEAD request to path1 wasn't followed by a subsequent GET request. The relevant code section is this, I think:

sphinx/sphinx/builders/linkcheck.py

Lines 327 to 344 in 93272b8

    
           try: 
        
               # try a HEAD request first, which should be easier on 
        
               # the server and the network 
        
               response = requests.head(req_url, allow_redirects=True, 
        
                                        config=self.config, auth=auth_info, 
        
                                        **kwargs) 
        
               response.raise_for_status() 
        
           # Servers drop the connection on HEAD requests, causing 
        
           # ConnectionError. 
        
           except (ConnectionError, HTTPError, TooManyRedirects) as err: 
        
               if isinstance(err, HTTPError) and err.response.status_code == 429: 
        
                   raise 
        
               # retry with GET request if that fails, some servers 
        
               # don't like HEAD requests. 
        
               response = requests.get(req_url, stream=True, 
        
                                       config=self.config, 
        
                                       auth=auth_info, **kwargs) 
        
               response.raise_for_status()

(I'm pretty slow to figure this stuff out, and perhaps this is verbose. I'm hoping that writing out the investigation process may help one of us discover more about the cause)

jayaddison · 2023-04-13T17:16:56Z

Hehe, as usual, I haven't read carefully enough. The path1 GET request failure was a timeout: that appears further down in the build logs when the results are printed:

(           index: line    1) broken    http://localhost:7777/path1 - HTTPConnectionPool(host='localhost', port=7777): Read timed out. (read timeout=0.05)

So I guess that lends more support to the increase-the-timeout-threshold approach.

jayaddison · 2023-04-13T17:27:59Z

(determining the root cause, py310-and-beyond nature of the flakiness, and recent appearance of the flakiness are all on my mind too. as is the fact that I haven't been able to replicate this without introducing an artificial time.sleep(0.15) call in the test util webserver run method)

jayaddison · 2023-04-13T17:32:04Z

I wonder whether adding a /healthcheck endpoint to the server and waiting for that to return an HTTP 2xx before commencing app.build in each test case might help.

Theory: perhaps some setup costs for the webserver itself are being delayed until after we begin the linkcheck builder. That could be related to why the path2 HEAD request is logged first in the HTTP server logs (noticeable, although possibly a red herring).

jayaddison · 2023-04-13T17:34:29Z

Hrm. No, setup time seems less applicable given that both HEAD requests succeed, and that it's only a subsequent GET request that fails (after the webserver is up-and-running).

jayaddison · 2023-04-13T17:44:16Z

(determining the root cause, py310-and-beyond nature of the flakiness, and recent appearance of the flakiness are all on my mind too. as is the fact that I haven't been able to replicate this without introducing an artificial time.sleep(0.15) call in the test util webserver run method)

Moving that time.sleep call into the RedirectOnceHandler could be a more accurate way to simulate the failure case (and note: the name of the handler describes the number of redirects that each individual request is provided with, not the total number of redirects that the handler performs in total).

jayaddison · 2023-04-13T17:53:52Z

An idea: navigation by difference: could there be clues in the differences between the test_linkcheck_allowed_redirects test and the other tests in the same file? Why does it seem to be the only flaky test in the module?

Another note: the redirect timeout configuration value is used as a per-request argument -- it's not a total for the duration of the linkcheck builder's operations.

jayaddison · 2023-04-14T09:23:38Z

Another "broken" test: https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312

Ok, yep - that was test_too_many_requests_retry_after_HTTP_date, and during (fairly naive) replicability testing in jayaddison/sphinx#1, test_raw_node is affected too.

So in fact this issue is not specific to test_linkcheck_allowed_redirects at all (and again, I should have read the build outputs and details more carefully, because that could've saved me a bunch of time yesterday).

jayaddison · 2023-04-14T22:38:21Z

tl;dr - my vote would be to increase the timeout thresholds

I ran a series of experimental evaluations in jayaddison/sphinx#1 to try to replicate the intermittent timeout flakiness (and to verify what I considered at one point to be a potential fix.. but have since decided should not be relevant).

The reason for testing in that pull request is partly because I couldn't easily determine a way to replicate the behaviour locally / outside of GitHub Actions. I used a reduced (but otherwise unaltered) GitHub Actions workflow configuration in that branch to test the results.

In short: the issue only appeared twice out of 100 experimental evaluations (and both times, kinda annoyingly, were without the suggested fix in place. I was hoping to find a timeout failure despite the fix being merged, to provide confirmation that the suggested fix was invalid).

jayaddison · 2023-04-22T12:49:57Z

Recap:

test_linkcheck_allowed_redirects: enable HTTP HEAD request support #11323 seemed to fix this, and I tested that in jayaddison/sphinx#1, but wasn't convinced about the correctness of it.
Increasing the timeouts has appeared to provide improved stability, but it feels a bit like a workaround as opposed to a comprehensive fix.
These failures began occurring again much more frequently within linkcheck builder: begin using session-based HTTP requests #11340 - likely due to refactoring of the test HTTP servers to enable connection keepalive and client connection pooling.

I think I may have found a more robust fix: https://github.com/jayaddison/sphinx/pull/2/commits/d6f29be325d657afa7f5a041f8fac85f460fa01e and am testing this currently in jayaddison/sphinx#2.

It seems slightly unpythonic, and I'm not completely sure whether the reasoning behind it is valid, so if anyone can see problems with it, please let me know.

jayaddison mentioned this issue Apr 11, 2023

linkcheck builder: streamed HTTP requests do not close their local socket #11317

Closed

jayaddison mentioned this issue Apr 13, 2023

test_linkcheck_allowed_redirects: enable HTTP HEAD request support #11323

Closed

jayaddison changed the title ~~Flaky unit test: test_linkcheck_allowed_redirects~~ Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) Apr 14, 2023

jayaddison mentioned this issue Apr 14, 2023

Increase timeout threshold for linkcheck tests #11326

Merged

AA-Turner closed this as completed in #11326 Apr 21, 2023

jayaddison mentioned this issue Apr 22, 2023

linkcheck builder: ensure error response is removed-from-scope before fallback request #11349

Closed

jayaddison mentioned this issue May 21, 2023

Refactor: linkcheck builder: reduce the lifetime of the 'response' variable #11432

Merged

github-actions bot locked as resolved and limited conversation to collaborators May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

jayaddison commented Apr 7, 2023 •

edited

Loading

picnixz commented Apr 7, 2023

jayaddison commented Apr 7, 2023

picnixz commented Apr 7, 2023

jayaddison commented Apr 7, 2023

picnixz commented Apr 7, 2023 •

edited

Loading

AA-Turner commented Apr 7, 2023

jayaddison commented Apr 7, 2023

jayaddison commented Apr 8, 2023

picnixz commented Apr 11, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 14, 2023

jayaddison commented Apr 14, 2023

jayaddison commented Apr 22, 2023

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

Comments

jayaddison commented Apr 7, 2023 • edited Loading

Describe the bug

How to Reproduce

Environment Information

Sphinx extensions

Additional context

picnixz commented Apr 7, 2023

jayaddison commented Apr 7, 2023

picnixz commented Apr 7, 2023

jayaddison commented Apr 7, 2023

picnixz commented Apr 7, 2023 • edited Loading

AA-Turner commented Apr 7, 2023

jayaddison commented Apr 7, 2023

jayaddison commented Apr 8, 2023

picnixz commented Apr 11, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 13, 2023

jayaddison commented Apr 14, 2023

jayaddison commented Apr 14, 2023

jayaddison commented Apr 22, 2023

jayaddison commented Apr 7, 2023 •

edited

Loading

picnixz commented Apr 7, 2023 •

edited

Loading