Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

xconverge · 2023-01-26T17:10:00Z

Bug Description

When running multiple LONG running fetches in parallel, some of them seem to just die and hit a timeout that doesnt really seem relevant

name: HeadersTimeoutError
    message: Headers Timeout Error
    stack: >-
      HeadersTimeoutError: Headers Timeout Error
          at onParserTimeout (/src/archive/undici/lib/client.js:908:28)
          at callTimer (/src/archive/undici/node_modules/@sinonjs/fake-timer
  s/src/fake-timers-src.js:743:24)
          at doTickInner (/src/archive/undici/node_modules/@sinonjs/fake-tim
  ers/src/fake-timers-src.js:1311:29)
          at doTick (/src/archive/undici/node_modules/@sinonjs/fake-timers/s
  rc/fake-timers-src.js:1392:20)
          at Object.tick (/src/archive/undici/node_modules/@sinonjs/fake-tim
  ers/src/fake-timers-src.js:1400:20)
          at Server.<anonymous>
  (/src/archive/undici/test/fetch/fetch-timeouts.js:32:11)
          at Server.emit (node:events:513:28)
          at parserOnIncoming (node:_http_server:1065:12)
          at HTTPParser.parserOnHeadersComplete (node:_http_common:117:17)
    code: UND_ERR_HEADERS_TIMEOUT

Reproducible By

Tweaking minutesPerRequest, numParallelRequests, and timeout here makes for some pretty confusing results. When timeout is 0, everything seems to work correctly. I suspect there are a few timers that are overlapping eachother or getting shared somehow.

Simply set timeout to 0 and see that the test passes, and set timeout to 30 minutes and see that it fails

"use strict";

const { test } = require("tap");

const { fetch, Agent } = require("../..");
const { createServer } = require("http");
const FakeTimers = require("@sinonjs/fake-timers");

// Tweaking request length can have an impact too potentially (Setting to 1 seems to pass, and setting it to 10 min fails on my machine)
const minutesPerRequest = 10;

// On my machine setting this to 0 passes, and setting it to something very high fails with a UND_ERR_HEADERS_TIMEOUT
const timeout = 1000 * 60 * (minutesPerRequest + 5);
// const timeout = 0;

// Tweaking num requests to either 1 or multiple has different impacts
const numParallelRequests = 5;

test("Fetch very long parallel requests, timeout overridden", (t) => {
  const msToDelay = 1000 * 60 * minutesPerRequest;

  t.setTimeout(undefined);
  t.plan(numParallelRequests);

  const clock = FakeTimers.install();
  t.teardown(clock.uninstall.bind(clock));

  const server = createServer((req, res) => {
    setTimeout(() => {
      res.end("hello");
    }, msToDelay);
    clock.tick(msToDelay + 1);
  });
  t.teardown(server.close.bind(server));

  server.listen(0, () => {
    for (let i = 0; i < numParallelRequests; i++) {
      fetch(`http://localhost:${server.address().port}`, {
        path: "/",
        method: "GET",
        dispatcher: new Agent({
          headersTimeout: timeout,
          connectTimeout: timeout,
          bodyTimeout: timeout,
        }),
      })
        .then((response) => response.text())
        .then((response) => {
          t.equal("hello", response);
        })
        .catch((err) => {
          // This should not happen, a timeout error should not occur
          t.error(err);
        });
    }

    clock.tick(msToDelay - 1);
  });
});

Expected Behavior

The fetches to succeed

Logs & Screenshots

Environment

Node v18.13.0

Additional context

The text was updated successfully, but these errors were encountered:

mcollina · 2023-01-26T20:28:13Z

Does this problem happen only with fake timers or "live" too?

xconverge · 2023-01-26T20:40:43Z

Does this problem happen only with fake timers or "live" too?

I am battling something brutal with doing 5x longish 20 minute requests not working on a real, deployed machine. 1 or 2 of the shorter requests work, but then the other few lingering just fully "die" and eventually trigger the headers timeout error if that is set. Running the requests individually have no issues. I dont necessarily experience the same issue on my local machine yet, so am still working on figuring out repro cases.

So there is definitely SOMETHING happening live, I will run this test right now and edit this comment with the results running it without faketimers

I tried changing the timeouts to 0 and still experience the issue so I suspect there is more to this than maybe just this specific example, but am only able to do one step at a time

I have had to switch back to axios for these requests and everything is fine with axios while I try to resolve this on the side.

xconverge · 2023-01-26T21:40:56Z

I am unable to get this same test to fail on my machine without faketimers... will keep trying to find ways to repro what I am seeing on my server, but any guidance on why the test is failing that might be a clue would be great

xconverge · 2023-01-30T06:34:09Z

I have to give up and go back to axios for this, I have tried so many different things to reproduce this locally and debug this but I cant and I understand this is not that actionable/helpful unfortunately. I cant change/control the server I am accessing and cant reproduce anything like this with any API servers (using the benchmarks/server as a base) I spin up (on a remote machine on my LAN) with various levels of timeouts, data sizes returned, etc.

I am doing 4 parallel 5 to 25 minute requests from a single node app on a linux (ubuntu) server on EC2 to 1 server that is running on Google Compute Engine. I can get a response for 1 or 2 of the 4 requests based on what I have changed, but never all 4 at the same time. The sockets show as all 4 get established (ss command). If I do all 4 requests in series (wait for the previous to complete before starting the next), they all work fine.

On my local dev machine (ubuntu), all 4 long fetches happen fine, the code all works fine, no issues at all.

I suspect there is something that is specific to the server I am accessing or some weird edge of an edge case here but have just run out of ideas and things to try unfortunately.

Things I have tried on the ec2 server (after first checking that all 4 requests worked with the change locally) without luck:

1 agent shared for all 4 requests
1 agent per request
pipelining: 0
pipelining: 10
header and body timeouts of 0
header and body timeouts of very high numbers ~60 minutes
Client() API instead of fetch
blocking: true flag
reset: true flag
keepAliveMaxTimeout increased to 60 minutes

The query looks like this, pretty straightforward other than the SSL pinning:

    fetch(`https://serverxyz`,
    {
      method: "GET",
      headers: {
        authorization: xyz,
      },
      dispatcher: new Agent({
                      headersTimeout: 0,
                      bodyTimeout: 0,

                      connect: {
                        ca,
                        checkServerIdentity: (
                          hostname: string,
                          cert: PeerCertificate
                        ): Error | undefined => {
                          const fingerprint = cert.fingerprint
                          if (fingerprints.includes(fingerprint)) {
                            return undefined
                          } else {
                            return Error(`SSL Pin Failed`)
                          }
                      },
                    }
                  })
    }).then((response) => xyz, not, getting to here on the remote for some of the requests)

mcollina · 2023-01-30T08:05:31Z

My experience with these kind of issues tells me the fakeTimers problem is not really related.

What this looks like: a firewall is truncating the connection in a way that we do not recognize, generating an error that we ignore.

xconverge · 2023-01-30T17:56:50Z

I just ran a bash script from the server fetching these requests in parallel with curl. A total of 8 requests, ranging from 2 minutes to 20 minutes all at the same time, same as I call from my node app. All 8 finished fine with no issues

I do think there is an error that is being ignored somehow, but am unsure how to even debug that.

The only other thing I can think to try right now is replace with node-fetch and simply say "yep it works or no it doesnt". If there is anywhere I can put some more debugging I am open to it!

xconverge · 2023-01-30T23:18:15Z

Aha! node-fetch acts nearly identically to undici and some of the requests fail!

This looks very interesting to me so I will take a look a bit to see if it makes a difference for me

https://github.com/node-fetch/node-fetch#custom-highwatermark

#923

Initial testing of changing from highWaterMark: 0 to highWaterMark: 5 made no difference, doesnt even seem to be getting this far

It kind of looks like the last request that is asked for completely overrides the earlier request. I added quite a bit of logging and it really seems like the requests just disappear. I dont know where to go from here with my debugging :/

mcollina · 2023-02-01T11:35:36Z

Can you try undici.request instead of fetch()?

xconverge · 2023-02-01T18:49:12Z

Can you try undici.request instead of fetch()?

undici.request behaves the same as fetch for this case after some testing

mcollina · 2023-02-01T20:51:09Z

I wonder what axios is doing to make the request succeed all the time.

xconverge · 2023-02-01T22:10:41Z

I wonder what axios is doing to make the request succeed all the time.

Yea I am really out of my depth here. How would I go about instrumenting the .js to really make sure keep-alives are working as intended for these long queries? Tracing around kKeepAliveTimeoutValue and trying to understand the keepalive within the codebase has me a bit lost. I did already try and remove this as a variable by using "pipelining: 0" but that might not have been the right way to go about it.

https://github.com/axios/axios/blob/54d3facb3b032665e6ae84e157073702b5c2e4d9/lib/adapters/http.js#L575

It just really seems like the shortest request is causing the others (longer ones) to be destroyed without any notice/events being thrown/shown to the user

I am also looking at the util.destroy(socket calls, maybe I will put a log statement in destroy() and see if that's getting called

xconverge · 2023-02-02T04:13:19Z

I am surprised how readable axios is, I never looked into it. Doing a tiny bit of digging but it is going to be something small. Really a shame I still cant find a local/developer friendly repro. I can repro it EVERY time though on my prod environment though, so at least it is not ephemeral or intermittent

This is nuanced and seems to differ a bit with how undici handles it axios/axios#1752
This is also nuanced and the attached issue is interesting axios/axios#3916

mcollina · 2023-02-02T19:29:04Z

I think those longs connections are timing out at the TCP level due to inactivity. We could think of providing an option to enable TCP keepalive.

https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay

xconverge · 2023-02-02T20:08:19Z

I think those longs connections are timing out at the TCP level due to inactivity. We could think of providing an option to enable TCP keepalive.

https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay

I will try hard coding that somewhere when I get a chance

xconverge · 2023-02-02T22:37:17Z

Great news, thanks for sticking with me here

I added a

  const keepAliveVal = 1000 * 60
  socket.setKeepAlive(true, keepAliveVal);
  console.log(`set socket keepalive to ${keepAliveVal}`)

right here

undici/lib/client.js

Line 1350 in 2b260c9

const socket = client[kSocket]

And tested it and success, my requests all go through and the performance is on par with axios!

I verified the keep-alive behavior before/after with ss -te

Excellllllent! Now the question is would you be open to me opening a PR, and how should it relate to the existing ClientOptions parameters

https://undici.nodejs.org/#/docs/api/Client?id=parameter-clientoptions

keepAliveMaxTimeout number | null (optional) - Default: 600e3 - The maximum allowed keepAliveTimeout when overridden by keep-alive hints from the server. Defaults to 10 minutes.
keepAliveTimeout number | null (optional) - Default: 4e3 - The timeout after which a socket without active requests will time out. Monitors time between activity on a connected socket. This value may be overridden by keep-alive hints from the server. See [MDN: HTTP - Headers - Keep-Alive directives](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Keep-Alive#directives) for more details. Defaults to 4 seconds.
keepAliveTimeoutThreshold number | null (optional) - Default: 1e3 - A number subtracted from server keep-alive hints when overriding keepAliveTimeout to account for timing inaccuracies caused by e.g. transport latency. Defaults to 1 second.

mcollina · 2023-02-02T23:47:55Z

It should be a new parameter: tcpKeepAliveInterval: number | null - Default: null.
@xconverge you can go ahead and start a PR and we can discuss about the default value for this later.

WDYT @ronag? Should this be enabled by default? It might actually help with some ECONNRESETs.

ronag · 2023-02-03T05:26:15Z

It should be a new parameter: tcpKeepAliveInterval: number | null - Default: null. @xconverge you can go ahead and start a PR and we can discuss about the default value for this later.

WDYT @ronag? Should this be enabled by default? It might actually help with some ECONNRESETs.

Is there any downside to it? I don't mind having it as default.

ronag · 2023-02-03T05:26:29Z

@xconverge would you mind opening a PR?

ronag · 2023-02-03T08:32:33Z

Actually, I think this might already work?

setGlobalDispatcher(new Agent({
  connect: {
    keepAlive: true
  }
}))

ronag · 2023-02-03T08:33:54Z

We just need to override the defaults in connect.js and document that we have different default than net/tls.connect.

xconverge · 2023-02-03T17:51:57Z

@ronag I made the latest changes in the PR, but yes I now see exactly what you are saying

Let me run a test with the current version of undici with:

setGlobalDispatcher(new Agent({
  connect: {
    keepAlive: true
    keepAliveInitialDelay: 60000
  }
}))

If this is true then should this just be a documentation exercise or do we still want to continue with some logic/parsing of an additional option where we set both of these by default if not defined explicitly by the user? I feel like since it is 2 flags being controlled by 1 undici specific parameter, it might be kind of weird but doable. (ie a user sets keepAlive: false, and tcpKeepAliveInterval to non-zero), just need a decent sized validation before overriding them but I am ok with that

xconverge · 2023-02-03T18:23:08Z

setting these in socket = tls.connect({}) doesnt work because we are upgrading/assigning the socket, but manually setting socket.setKeepAlive with them after connect() works, so I think this will still require a code change

keepAlive [<boolean>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) If set to true, it enables keep-alive functionality on the socket immediately after the connection is established, similarly on what is done in [socket.setKeepAlive([enable][, initialDelay])](https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay). Default: false.
keepAliveInitialDelay [<number>](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type) If set to a positive number, it sets the initial delay before the first keepalive probe is sent on an idle socket.Default: 0.

xconverge added the bug Something isn't working label Jan 26, 2023

xconverge mentioned this issue Jan 26, 2023

Fix fetch parameters not being applied correctly #1870

Merged

xconverge mentioned this issue Feb 3, 2023

Add ability to set TCP keepalive #1904

Merged

mcollina closed this as completed in #1904 Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

xconverge commented Jan 26, 2023

mcollina commented Jan 26, 2023

xconverge commented Jan 26, 2023 •

edited

Loading

xconverge commented Jan 26, 2023

xconverge commented Jan 30, 2023 •

edited

Loading

mcollina commented Jan 30, 2023

xconverge commented Jan 30, 2023

xconverge commented Jan 30, 2023 •

edited

Loading

mcollina commented Feb 1, 2023

xconverge commented Feb 1, 2023

mcollina commented Feb 1, 2023

xconverge commented Feb 1, 2023 •

edited

Loading

xconverge commented Feb 2, 2023 •

edited

Loading

mcollina commented Feb 2, 2023

xconverge commented Feb 2, 2023

xconverge commented Feb 2, 2023

mcollina commented Feb 2, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

xconverge commented Feb 3, 2023 •

edited

Loading

xconverge commented Feb 3, 2023 •

edited

Loading

Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

Multiple parallel long fetches with overridden, non-zero, timeouts fails all the time #1887

Comments

xconverge commented Jan 26, 2023

Bug Description

Reproducible By

Expected Behavior

Logs & Screenshots

Environment

Additional context

mcollina commented Jan 26, 2023

xconverge commented Jan 26, 2023 • edited Loading

xconverge commented Jan 26, 2023

xconverge commented Jan 30, 2023 • edited Loading

mcollina commented Jan 30, 2023

xconverge commented Jan 30, 2023

xconverge commented Jan 30, 2023 • edited Loading

mcollina commented Feb 1, 2023

xconverge commented Feb 1, 2023

mcollina commented Feb 1, 2023

xconverge commented Feb 1, 2023 • edited Loading

xconverge commented Feb 2, 2023 • edited Loading

mcollina commented Feb 2, 2023

xconverge commented Feb 2, 2023

xconverge commented Feb 2, 2023

mcollina commented Feb 2, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

ronag commented Feb 3, 2023

xconverge commented Feb 3, 2023 • edited Loading

xconverge commented Feb 3, 2023 • edited Loading

xconverge commented Jan 26, 2023 •

edited

Loading

xconverge commented Jan 30, 2023 •

edited

Loading

xconverge commented Jan 30, 2023 •

edited

Loading

xconverge commented Feb 1, 2023 •

edited

Loading

xconverge commented Feb 2, 2023 •

edited

Loading

xconverge commented Feb 3, 2023 •

edited

Loading

xconverge commented Feb 3, 2023 •

edited

Loading