-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299
Comments
First, it seems that it only affects Python 3.11 and Python 3.12 builds. One possibility that comes up to me is we might have some race conditions or some deadlock. This may be due to the new implementation of the HTTP server with filelocks but I am not entirely sure about it. In order to reproduce, I think we need to sequentially run Python 3.10, 3.11 and 3.12 builds so that when Python 3.11 build starts, the Python 3.10 build may not necessarily be "properly" finished (perhaps the timeout is too short for acquiring / releasing all locks properly ?). Alternatively, I would suggest checking whether such failures occurs prior to a80e3fd. |
Do you know where the test parallelism is configured/declared, @picnixz? (I was looking around for it, but haven't figured it out yet) |
If you want to run it using CI, I think this should in the I am not even sure that the issue is due to deadlocks/race conditions. Maybe it's an issue that we can only see on github for instance. Maybe @marxin has some idea ? |
Ah, ok, understood (I think). So the current codebase doesn't specify parallelism for the tests, but it can be enabled locally, and doing that is what causes the localhost port-binding issue that #11294 addressed. Maybe an odd idea: could/should we use a separate port for each test that runs a local HTTP server? |
Concerning #11294, you are right. The issue is that this might have induced some issues with the Github CI workflow and this is extremely hard to reproduce (like, we don't know if this is because of the underlying CI framework, if it is because of pytest itself or if this is because of the (physical) machine that is running the CI checks). Using a separate port for each test would be a possibility, but this means that we need to know which port is used and which port is not. There is no guarantee that when a test finishes, then the port number that it was using is then free. In particular, we need to have as many ports as there will be tests but this 1) is prone to errors if we incorrectly choose the port 2) requires to know which ports we can use 3) and may conflict with local ports when we are testing. One possibility is to wait between the execution of two HTTP-related tests but this would increase the duration of the tests (which are already taking quite a long time to run on github). One possibility is to first concurrently run all non-HTTP tests as much as possible and then run all HTTP-related tests sequentially. This may serve in isolating future network-related issues as well. |
In GitHub Actions the runners are all independent, there's no shared state between them. I think the likeliest issue is simply that the runners have lower resources (2 cores/etc) and hit the low timeout. I lowered the timeouts in 97f07ca to reduce the time taken by the linkcheck tests, and had to increase them once in 7d928b3 to fix instability, perhaps we need to increase again. A |
+1 on the observation about no-shared-state between the runners. My understanding is that It's a good goal to allow running tests with parallelism. I've found the |
After investigating further, it doesn't look like |
The file locking approach seems fine to me, and upping the timeout seems OK, but it would be nice to find the cause. Looking at the HTTP server logs in the case of the
So it seems like the HEAD request to sphinx/sphinx/builders/linkcheck.py Lines 327 to 344 in 93272b8
(I'm pretty slow to figure this stuff out, and perhaps this is verbose. I'm hoping that writing out the investigation process may help one of us discover more about the cause) |
Hehe, as usual, I haven't read carefully enough. The
So I guess that lends more support to the increase-the-timeout-threshold approach. |
(determining the root cause, py310-and-beyond nature of the flakiness, and recent appearance of the flakiness are all on my mind too. as is the fact that I haven't been able to replicate this without introducing an artificial |
I wonder whether adding a Theory: perhaps some setup costs for the webserver itself are being delayed until after we begin the |
Hrm. No, setup time seems less applicable given that both HEAD requests succeed, and that it's only a subsequent GET request that fails (after the webserver is up-and-running). |
Moving that |
An idea: navigation by difference: could there be clues in the differences between the Another note: the redirect |
Ok, yep - that was So in fact this issue is not specific to |
tl;dr - my vote would be to increase the timeout thresholds I ran a series of experimental evaluations in jayaddison/sphinx#1 to try to replicate the intermittent timeout flakiness (and to verify what I considered at one point to be a potential fix.. but have since decided should not be relevant). The reason for testing in that pull request is partly because I couldn't easily determine a way to replicate the behaviour locally / outside of GitHub Actions. I used a reduced (but otherwise unaltered) GitHub Actions workflow configuration in that branch to test the results. In short: the issue only appeared twice out of 100 experimental evaluations (and both times, kinda annoyingly, were without the suggested fix in place. I was hoping to find a timeout failure despite the fix being merged, to provide confirmation that the suggested fix was invalid). |
Recap:
I think I may have found a more robust fix: https://github.com/jayaddison/sphinx/pull/2/commits/d6f29be325d657afa7f5a041f8fac85f460fa01e and am testing this currently in jayaddison/sphinx#2. It seems slightly unpythonic, and I'm not completely sure whether the reasoning behind it is valid, so if anyone can see problems with it, please let me know. |
Describe the bug
The
test_linkcheck_allowed_redirects
test appears to be failing rarely and intermittently:broken
result is received instead ofworking
:Other tests within the same
linkcheck
builder test module also appear to be affected:test_too_many_requests_retry_after_HTTP_date
https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312 (thanks, @picnixz)test_raw_node
https://github.com/jayaddison/sphinx/actions/runs/4697985339/jobs/8329708670?pr=1#step:10:2447It's possible that they're slightly separate issues, but are likely to be related (and probably can be fixed at the same time).
How to Reproduce
Run the relevant
main.yml
continuous integrationpytest
tests on a repeated basis. This is demonstrated by jayaddison/sphinx#1.Environment Information
Sphinx extensions
Additional context
No response
The text was updated successfully, but these errors were encountered: