-
Notifications
You must be signed in to change notification settings - Fork 544
ListWatch stops reconnecting on connection errors #2385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just noted that this issue has some similarities to #1496, although that issue has been marked as stale. I hope my example code helps to reproduce the issue. |
Something I found while debugging: The first Lines 46 to 53 in 88184dc
Although the function is intended to propagate errors only once, it calls Moving the let doneCalled: boolean = false;
const doneCallOnce = (err: any) => {
if (!doneCalled) {
doneCalled = true;
controller.abort();
done(err);
}
}; Not sure if I should be creating a separate issue for this? |
Probably having a separate issue is worthwhile. It does seem like that early return is the issue. If you want to send two PRs to fix those two issues (and tests too) that'd be great. |
Okay, will do!
Thanks! I'll definitely look into that! I'd love to, but I have to admit that I have no experience with "nock" (which seems to be the main method of mocking HTTP requests in the tests this project uses) and I've been struggling to replicate the exact scenarios. I can reproduce it just fine with my minimal test code. The issue with "done" being called twice in particular is hard to reproduce in tests: The fix there seems trivial, but creating a test to reproduce the exact issue is frustratingly hard, it seems to take a very special combination of events which I can't seem to replicate with "nock". I've tried destroying the streams, response objects, and all that, but behavior is different enough that it doesn't trigger the error. I'll try some different ways, but was wondering if you'd be opposed to including a minimal HTTP server in the test? The code below seems to be enough to reliably replicate the issue (with "done" being called twice) for me: const minimalServer = http.createServer((req, res) => {
res.writeHead(200, {
'Content-Type': 'application/json',
'Transfer-Encoding': 'chunked',
})
res.flushHeaders()
res.destroy() // Prematurely close the connection
}) I'm open to ideas if you have some other suggestions! |
Now that #2389 merged I think that we can remove the |
Describe the bug
The ListWatch class stops reconnecting when a non-410 error occurs during a watch session, due to an early return in the
doneHandler()
function.Client Version
1.1.2
Server Version
v1.32.2-gke.1182003
To Reproduce
Listwatch
to watch a namespace that doesn't have any pods (to get a stream that doesn't really send anything)timeoutSeconds
in the request (as that seems to cause a clean disconnect)ERR_STREAM_PREMATURE_CLOSE
)Note: I can easily reproduce this issue in our GKE cluster, but I can understand that the timeout here may be a bit tricky to reproduce in other environments.
See my script below for a mock server that simulates the issue on a local machine.
Expected behavior
I would expect the watch to attempt to reconnect on a connection error, not just 410 errors.
Example Code
I created a mock server to reproduce the issue. The server will:
(although we'll not reach that point in this example)
Observe that:
AbortError: The user aborted a request.
Error: Premature close
Output:
Alternatively, here's the original code I used to reproduce the issue, which uses a real GKE cluster:
Environment (please complete the following information):
Additional context
The likely root cause is the
doneHandler()
function in theListWatch
class:javascript/src/cache.ts
Lines 143 to 147 in 073175f
On line 147, the function currently returns early if the error is not a 410 error, which prevents the watch from reconnecting.
Simply patching out this return statement kind of works, but it seems to cause two subsequent connection attempts (probably one for each error?). Since I’m not familiar enough with the internals of this project, I don't feel confident enough to open a PR at this time.
The text was updated successfully, but these errors were encountered: