Skip to content

ListWatch stops reconnecting on connection errors #2385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bverhoeven opened this issue Apr 22, 2025 · 5 comments
Open

ListWatch stops reconnecting on connection errors #2385

bverhoeven opened this issue Apr 22, 2025 · 5 comments

Comments

@bverhoeven
Copy link
Contributor

bverhoeven commented Apr 22, 2025

Describe the bug

The ListWatch class stops reconnecting when a non-410 error occurs during a watch session, due to an early return in the doneHandler() function.

Client Version
1.1.2

Server Version
v1.32.2-gke.1182003

To Reproduce

  1. Use Listwatch to watch a namespace that doesn't have any pods (to get a stream that doesn't really send anything)
  2. Do not specify a timeoutSeconds in the request (as that seems to cause a clean disconnect)
  3. Observe that the "connect" event is emitted.
  4. After 5 minutes or upon a connection error, the watch ends with a set of errors:
    • AbortError: The user aborted a request.
    • Error: Premature close (code ERR_STREAM_PREMATURE_CLOSE)
  5. No further attempts to reconnect are made (no "connect" events are emitted)

Note: I can easily reproduce this issue in our GKE cluster, but I can understand that the timeout here may be a bit tricky to reproduce in other environments.

See my script below for a mock server that simulates the issue on a local machine.

Expected behavior

I would expect the watch to attempt to reconnect on a connection error, not just 410 errors.

Example Code

I created a mock server to reproduce the issue. The server will:

  • respond with an initial list of pods to non-watch requests
  • cleanly close the connection for the first watch request
  • prematurely close the connection for the second watch request
  • cleanly close the connection for the third and subsequent watch requests
    (although we'll not reach that point in this example)

Observe that:

  • When the request is cleanly closed, the watch reconnects and the "connect" event is emitted.
  • When the request is closed prematurely, the watch does not reconnect and the "connect" event is not emitted, instead two errors are reported:
    • AbortError: The user aborted a request.
    • Error: Premature close
import http from 'http'
import { CoreV1Api, KubeConfig, ListWatch, Watch } from '@kubernetes/client-node'

const MOCK_PORT = 8333
const MOCK_URL = `http://localhost:${MOCK_PORT}`

let requestCount = 0

const server = http.createServer((req, res) => {
    res.writeHead(200, {
        'Content-Type': 'application/json',
        'Transfer-Encoding': 'chunked',
    })
    res.flushHeaders()

    if (req.url?.includes('watch=true')) {
        if (requestCount++ == 1) {
            console.log('[mock-server] watch: Closing connection prematurely')
            res.destroy()
        } else {
            console.log('[mock-server] watch: Closing connection cleanly.')
            res.end()
        }
    } else {
        console.log('[mock-server] Responding with initial list')
        res.end(
            JSON.stringify({
                kind: 'PodList',
                apiVersion: 'v1',
                metadata: {},
                items: [],
            })
        )
    }
})

async function setupWatcher() {
    const kubeConfig = new KubeConfig()

    kubeConfig.loadFromOptions({
        clusters: [{ name: 'mock-cluster', server: MOCK_URL, skipTLSVerify: true }],
        users: [{ name: 'mock-user' }],
        contexts: [{ name: 'mock-context', user: 'mock-user', cluster: 'mock-cluster' }],
        currentContext: 'mock-context',
    })

    const api = kubeConfig.makeApiClient(CoreV1Api)

    const watch = new Watch(kubeConfig)
    const watcher = new ListWatch(
        `/api/v1/namespaces/default/pods`,
        watch,
        () => api.listNamespacedPod({ namespace: 'default' }),
        false
    )

    watcher.on('connect', () => console.log('[watcher]     Connecting...'))
    watcher.on('error', (err) => console.error('[watcher]     Error', err))
    watcher.start()
}

server.listen(MOCK_PORT, () => {
    console.log(`[mock-server] Listening on port ${MOCK_PORT}`)
    setupWatcher().catch(console.error)
})

Output:

[mock-server] Listening on port 8333
[watcher]     Connecting...
[mock-server] Responding with initial list
[mock-server] watch: Closing connection cleanly.
[watcher]     Connecting...
[mock-server] Responding with initial list
[mock-server] watch: Closing connection prematurely
[watcher]     Error AbortError: The user aborted a request.
    at abort (/home/bas/projects/watch-repro/node_modules/node-fetch/lib/index.js:1458:16)
    at AbortSignal.abortAndFinalize (/home/bas/projects/watch-repro/node_modules/node-fetch/lib/index.js:1473:4)
    at [nodejs.internal.kHybridDispatch] (node:internal/event_target:827:20)
    at AbortSignal.dispatchEvent (node:internal/event_target:762:26)
    at runAbort (node:internal/abort_controller:449:10)
    at abortSignal (node:internal/abort_controller:435:3)
    at AbortController.abort (node:internal/abort_controller:468:5)
    at PassThrough.doneCallOnce (file:///home/bas/projects/watch-repro/node_modules/@kubernetes/client-node/dist/watch.js:33:28)
    at PassThrough.emit (node:events:519:35)
    at emitErrorNT (node:internal/streams/destroy:170:8) {
  type: 'aborted'
}
[watcher]     Error Error: Premature close
    at IncomingMessage.<anonymous> (/home/bas/projects/watch-repro/node_modules/node-fetch/lib/index.js:1748:18)
    at Object.onceWrapper (node:events:621:28)
    at IncomingMessage.emit (node:events:507:28)
    at emitCloseNT (node:internal/streams/destroy:148:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:89:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

Alternatively, here's the original code I used to reproduce the issue, which uses a real GKE cluster:

import { CoreV1Api, KubeConfig, ListWatch, Watch } from "@kubernetes/client-node"

async function main() {
    const kubeConfig = new KubeConfig()
    kubeConfig.loadFromDefault()

    const api = kubeConfig.makeApiClient(CoreV1Api)
    const path = `/api/v1/namespaces/default/pods`
    const watch = new Watch(kubeConfig)

    const watcher = new ListWatch(
        path,
        watch,
        () => api.listNamespacedPod({ namespace: 'default' }),
        false,
        undefined
    )

    watcher.on('connect', () => console.log('connect'))
    watcher.on('change', () => console.log('change'))
    watcher.on('error', (err) => console.log('error:', err))
    watcher.start()

    setInterval(() => console.log('tick (30s)'), 30_000)
}

main().catch(console.error)

Environment (please complete the following information):

  • OS: Linux
  • Node.js version 23.11.0
  • Cloud runtime GCP

Additional context

The likely root cause is the doneHandler() function in the ListWatch class:

javascript/src/cache.ts

Lines 143 to 147 in 073175f

if (err && err.statusCode === 410) {
this.resourceVersion = '';
} else if (err) {
this.callbackCache[ERROR].forEach((elt: ErrorCallback) => elt(err));
return;

On line 147, the function currently returns early if the error is not a 410 error, which prevents the watch from reconnecting.

Simply patching out this return statement kind of works, but it seems to cause two subsequent connection attempts (probably one for each error?). Since I’m not familiar enough with the internals of this project, I don't feel confident enough to open a PR at this time.

@bverhoeven
Copy link
Contributor Author

I just noted that this issue has some similarities to #1496, although that issue has been marked as stale. I hope my example code helps to reproduce the issue.

@bverhoeven
Copy link
Contributor Author

Something I found while debugging: The first AbortError is likely caused by the fact that the doneCallOnce function in watch.ts is reports errors twice.

javascript/src/watch.ts

Lines 46 to 53 in 88184dc

let doneCalled: boolean = false;
const doneCallOnce = (err: any) => {
if (!doneCalled) {
controller.abort();
doneCalled = true;
done(err);
}
};

Although the function is intended to propagate errors only once, it calls controller.abort() before setting doneCalled. This appears to trigger the AbortError, which in turn causes doneCallOnce to be invoked again. Since doneCalled hasn’t been set to true yet, the second error gets reported as well.

Moving the doneCalled = true assignment to the top of the function seems to resolve the issue - in my test case, only one error is reported after the change.

        let doneCalled: boolean = false;
        const doneCallOnce = (err: any) => {
            if (!doneCalled) {
                doneCalled = true;
                controller.abort();
                done(err);
            }
        };

Not sure if I should be creating a separate issue for this?

@brendandburns
Copy link
Contributor

Probably having a separate issue is worthwhile.

It does seem like that early return is the issue.

If you want to send two PRs to fix those two issues (and tests too) that'd be great.

@bverhoeven
Copy link
Contributor Author

Probably having a separate issue is worthwhile.

Okay, will do!

If you want to send two PRs to fix those two issues (and tests too) that'd be great.

Thanks! I'll definitely look into that!

I'd love to, but I have to admit that I have no experience with "nock" (which seems to be the main method of mocking HTTP requests in the tests this project uses) and I've been struggling to replicate the exact scenarios. I can reproduce it just fine with my minimal test code.

The issue with "done" being called twice in particular is hard to reproduce in tests: The fix there seems trivial, but creating a test to reproduce the exact issue is frustratingly hard, it seems to take a very special combination of events which I can't seem to replicate with "nock". I've tried destroying the streams, response objects, and all that, but behavior is different enough that it doesn't trigger the error.

I'll try some different ways, but was wondering if you'd be opposed to including a minimal HTTP server in the test? The code below seems to be enough to reliably replicate the issue (with "done" being called twice) for me:

const minimalServer = http.createServer((req, res) => {
    res.writeHead(200, {
        'Content-Type': 'application/json',
        'Transfer-Encoding': 'chunked',
    })
    res.flushHeaders()
    res.destroy() // Prematurely close the connection
})

I'm open to ideas if you have some other suggestions!

@brendandburns
Copy link
Contributor

Now that #2389 merged I think that we can remove the return call which is the problem preventing reconnection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants