Regression issue with keep alive connections #27363

OrKoN · 2019-04-23T13:41:19Z

Version: 10.15.3
Platform: Linux
Subsystem:

Hi,

We updated the node version from 10.15.0 to 10.15.3 for a service which runs behind the AWS Application Load Balancer. After that our test suite revealed an issue which we didn't see before an update which results in HTTP 502 errors thrown by the load balancer. Previously, this was happening if the Node.js server closed a connection before the load balancer. We solved this by setting server.keepAliveTimeout = X where X is higher than the keep-alive timeout on the load balancer side.

With version 10.15.3 setting server.keepAliveTimeout = X does not work anymore and we see regular 502 errors by the load balancer. I have checked the changelog for Node.js, and it seems that there was a change related to keep-alive connection in 10.15.2 1a7302bd48 which might have caused the issue we are seeing.

Does anyone know if the mentioned change can cause the issue we are seeing? In particular, I believe the problem is that the connection is closed before the specified keep-alive timeout.

The text was updated successfully, but these errors were encountered:

BridgeAR · 2019-04-24T08:39:28Z

// cc @nodejs/http

bnoordhuis · 2019-04-24T08:57:38Z

The slowloris mitigations only apply to the HTTP headers parsing stage. Past that stage the normal timeouts apply (barring bugs, of course.)

Is it an option for you to try out 10.15.1 and 10.15.2, to see if they exhibit the same behavior?

OrKoN · 2019-04-24T14:23:34Z

In our test suite, there are about 250 HTTP requests. I have run the test suite four times for each of the following node versions 10.15.0, 10.15.1, 10.15.2. For 10.15.0 & 10.15.1 there was zero HTTP failures. For 10.15.2 there are on average two failures per test suite run (HTTP 502). In every run, a different test case fails so failures are not deterministic.

I tried to build a simple node server and reproduce the issue with it, but so far without any success. We will try to figure out what is the exact pattern and the volume of requests to reproduce the issue. Timing and the speed of the client might matter.

shuhei · 2019-04-27T13:54:20Z

I guess that headersTimeout should be longer than keepAliveTimeout because after the first request of a keep-alive connection,headersTimeout is applied to the period between the end of the previous request (even before its response is sent) and the first parsing of the next request.

@OrKoN What happens with your test suite if you put a headersTimeout longer than keepAliveTimeout?

shuhei · 2019-04-28T18:41:45Z

Created a test case that reproduces the issue. It fails on 10.15.2 and 10.15.3. (Somehow headersTimeout seems to work only when headers are sent in multiple packets.)

To illustrate the issue with an example of two requests on a keep-alive connection:

A connection is made
The server receives the first packet of the first request's headers
The server receives the second packet of the first request's headers
The server sends the response for the first request
(...idle time...)
The server receives the first packet of the second request's headers
The server receives the second packet of the second request's headers

keepAliveTimeout works for 4-6 (the period between 4 and 6). headersTimeout works for 3-7. So headersTimeout should be longer than keepAliveTimeout in order to keep connections until keepAliveTimeout.

I wonder whether headersTimeout should include 3-6. 6-7 seems more intuitive for the name and should be enough for mitigating Slowloris DoS because 3-4 is up to the server and 4-6 is covered by keepAliveTimeout.

OrKoN · 2019-04-29T06:00:15Z

@shuhei so you mean that headersTimeout spans multiple requests on the same connection? I have not tried to change the headersTimeout because I expected it to work for a single request only and we have no long requests in our test suite. It looks like the headers timer should reset when a new request arrives but it's defined by the first request for a connection.

shuhei · 2019-04-29T06:28:13Z

@OrKoN Yes, headersTimeout spans parts of two requests on the same connection including the interval between the two requests. Before 1a7302bd48, it was only applied to the first request. The commit started resetting the headers timer when a request is done in order to apply headersTimeout to subsequent requests in the same connection.

OrKoN · 2019-04-29T06:43:15Z

I see. So it looks like an additional place to reset the timer would be the beginning of a new request? And parserOnIncoming is only called once the headers are parsed, so it need to be some other place then.

P.S. I will run our tests with increased headerTimeout today to see if it helps.

OrKoN · 2019-04-29T09:04:26Z

So we have applied the workaround (headersTimeout > keepAliveTimeout) and the errors are gone. 🎉

alaz · 2019-05-01T04:43:48Z

I faced this issue too. I configured my Nginx load balancer to use keepalive when connecting to Node upstreams. I already saw it dropping connections and found the reason. I switched to Node 10 after that and was surprised to see this happening again: Nginx reports that Node closed the connection unexpectedly and then Nginx disables that upstream for a while.

I have not seen this problem after tweaking header timeouts yesterday as proposed by @OrKoN above. I think this is a serious bug, since it results in load balancers switching nodes off and on.

Why does not anybody else find this bug alarming? My guess is that -

there are no traces of it on Node instances itself. No log messages, nothing.
Web users connecting to Node services directly may simply ignore that few connections are dropped. The rate was not high in my case (maybe a couple of dozens per day while we serve millions of connections daily), so the chance of a particular visitor to experience this is relatively small.
and I found the bug indirectly based on the load balancer's logs: not everyone keeps an eye on the logs closely.

yoavain · 2019-05-23T20:55:08Z

We're having the same problem after upgrading from 8.x to 10.15.3.
However, I don't think it's a regression from 10.15.2 to 10.15.3.
This discussion goes way back to this issue #13391
I forked the example code there and created a new test case that fails on all the following versions:
12.3.1, 10.15.3, 10.15.2, 10.15.1, 10.15.0, 8.11.2

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

So I tweaked the test so that:

The server time to answer a request is 3 * keepAliveTimeout minus few (random) milliseconds (keeping 2-3 request alive).
The client fires not only one request, but a request every (exactly) keepAliveTimeout. This makes sure that the client request are aligned with the server connection keepAliveTimeout.

The result are pretty consistent:

Error: socket hang up
    at createHangUpError (_http_client.js:343:17)
    at Socket.socketOnEnd (_http_client.js:444:23)
    at Socket.emit (events.js:205:15)
    at endReadableNT (_stream_readable.js:1137:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:9) {
  code: 'ECONNRESET'
}

You can clone the code from yoavain/node8keepAliveTimeout

npm install
npm run test -- --keepAliveTimeout 5000
(Note that the keepAliveTimeout is also the client requests interval)

When setting keepAliveTimeout to 0, the problem is gone.

npm run test -- --keepAliveTimeout 5000 --keepAliveDisabled

dansantner · 2019-10-10T16:01:48Z

Thanks for the info guys! This is a nasty issue that reared it's head when we went straight from 10.14 to 12. Node kept dropping our connections before the AWS Load Balancer knew about it. Once I set the ELB timeout < keepAliveTimeout < headersTimeout (we weren't even setting that one) the problem went away.

markfermor · 2019-10-16T08:59:52Z

The original code did not fail in a consistent way, which led me to believe there's some kind of a race condition, where between the keepAliveTimeout check and the connection termination, a new connection can try to reuse it.

I can confirm I'm pretty sure we're seeing this as well (v10.13.0). We have Nginx in front of NodeJS within K8s. We were seeing random "connection reset by peer" or "upstream prematurely closed connection" for requests Nginx was sending to nodeJS apps. On all these occasions the problem was occurring for connections established by Nginx to Node. Right on the default 5 second keepAliveTimeout on the nodeJS side, nginx decided to reuse it's open/established connection to the node process and send another request (however technically outside of the 5 second timeout limit on the node side by <2ms). NodeJS accepted this new request over the existing connection, responded with an ACK packet, then <2ms later node also followed up with a RST packet closing the connection. However stracing the nodeJS process I could see the app code had received the request and was processing it, but before the response could be sent, node had already closed the connection. I would second the thoughts that there is a slight race condition between the point the connection is about to be closed by nodeJS but it still accepting an incoming request.

To avoid we simply increased the nodeJS keepAliveTimeout to be higher than Nginx's, thus giving Nginx the power over the keepAlive connections. http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive_timeout

PrintScreen of a packet capture taken on the nodeJS side of the connection is attached:

kirillgroshkov · 2019-10-30T17:14:16Z

Wow, very interesting thread. I have a suspicion that we're facing similar issue in AppEngine Node.js Standard. ~100 502 errors a day from ~1M requests per day total (~0.01% of all requests)

michalschott · 2020-02-18T13:15:01Z

Can confirm this is still the case in v10.19.0 release.

For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. Fixes: nodejs#27363

sroze · 2020-03-24T20:31:42Z

We have investigated an issue for a very small subset of requests as well and this was the root-cause. The behaviour we see is exactly what @markfermor's described, you can read even more in our investigation details. The following configuration lines did indeed solve the issue:

server.keepAliveTimeout = 76 * 1000;
server.headersTimeout = 77 * 1000;

kirillgroshkov · 2020-03-24T20:35:17Z

This is what solved it for us in AppEngine:

this.server.keepAliveTimeout = 600 * 1000

For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

vrozov · 2020-04-09T19:40:04Z

@mcollina @OrKoN Is there a plan to release 10.15.4 with the fix?

mcollina · 2020-04-09T21:11:55Z

Node 10.15.4 is not happening. At best, it might be backported to Node 10.20.1 or analog.

wyardley · 2020-04-09T21:55:39Z

Would love to see it in 10.20.x or similar if it’s at all feasible to port it back.

For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

glasser · 2020-05-14T21:20:23Z

Note that this also appears to be broken in v12, not just the recent v10s.

yokomotod · 2020-06-22T04:56:35Z

Since #32329 was merged, now I don't need to set headersTimeout bigger than keepAliveTimeout, right ?

For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. Backport-PR-URL: #34131 PR-URL: #32329 Fixes: #27363 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

Xilis · 2020-10-07T17:30:10Z

@yokomotod if you're asking for v12, v12.19.0 contains the fix

v12.18.4...v12.19.0#diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

Samjin · 2021-12-17T03:14:51Z

@yokomotod if you're asking for v12, v12.19.0 contains the fix

v12.18.4...v12.19.0diff-feaf3339998a19f0baf3f82414762c22

https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

If default is 60s then we still need to override that if ELB has longer duration, right?

Xilis · 2021-12-17T09:34:03Z

@yokomotod if you're asking for v12, v12.19.0 contains the fix
v12.18.4...v12.19.0diff-feaf3339998a19f0baf3f82414762c22
https://github.com/nodejs/node/blob/master/doc/changelogs/CHANGELOG_V12.md#12.19.0

If default is 60s then we still need to override that if ELB has longer duration, right?

Yes, the application keep alive timeout should be higher than whatever is in front (nginx, ELB, ...). Whether to change the timeout on the side of the application or the load balancer depends on your setup (e.g. google cloud HTTP load balancer does not allow changing the timeout value). I'd suggest making your application agnostic to how it is deployed and make this value configurable through an environment variable for example.

The NodeJS default depends on the version you're running (I think it is 5 seconds for all non-obsolete versions), you can check for your version in the docs here.

…46052) Resolves #39689, partially resolves #28642 (see notes below) Inspired by #44627 In #28642 it was also asked to expose `server.headersTimeout`, but it is probably not needed for most use cases and not implemented even in `next start`. It was needed to change this option before nodejs/node#27363. There also exists a rare bug that is described here nodejs/node#32329 (comment). To fix this exposing `server.headersTimeout` might be required both in `server.js` and in `next start`. Co-authored-by: JJ Kasper <jj@jjsweb.site>

pankaz · 2024-02-07T12:29:00Z

Hi everyone. Thank you for sharing these insights. Is there a chance that the fix is missing in v20.11.0?

This issue does not occur in 16.16.0 or 18.19.0. However, I'm noticing 502 errors when I use v20.11.0. Would love to get feedback on whether other members in the community are facing this issue.

hjr3 · 2024-04-03T10:56:46Z

However, I'm noticing 502 errors when I use v20.11.0.

I am seeing the same issue on v20.9.0 using Fastify as my server.

For keep-alive connections, the headersTimeout may fire during subsequent request because the measurement was reset after a request and not before a request. PR-URL: nodejs/node#32329 Fixes: nodejs/node#27363 Reviewed-By: Anna Henningsen <anna@addaleax.net> Reviewed-By: Matteo Collina <matteo.collina@gmail.com>

BridgeAR added the http Issues or PRs related to the http subsystem. label Apr 24, 2019

CatWithApple mentioned this issue Sep 23, 2019

fix: increase timeouts for GUI server gemini-testing/html-reporter#265

Merged

OrKoN mentioned this issue Oct 30, 2019

http: fix incorrect headersTimeout measurement #30184

Closed

3 tasks

tjanczuk mentioned this issue Jan 23, 2020

Fix HTTP 502s from AWS ALB fivequarters/q5#533

Merged

nrcmkoh mentioned this issue Jan 31, 2020

How to set keepAliveTimeout and headersTimeout? typestack/routing-controllers#535

Closed

mzikherman mentioned this issue Feb 7, 2020

@zephraph => [Node] Patch update to v10.19 for critical security fix artsy/force#5022

Merged

OrKoN mentioned this issue Mar 17, 2020

http: fix incorrect headersTimeout measurement (alt) #32329

Closed

4 tasks

mcollina closed this as completed in b149eef Apr 2, 2020

insightfuls mentioned this issue Apr 24, 2020

Support setting http.Server properties through config. kierans/koa-simple-web#3

Merged

qrli mentioned this issue Apr 28, 2020

Reverse proxy returns 500 on ERROR_WINHTTP_INVALID_SERVER_RESPONSE microsoft/service-fabric#533

Open

This was referenced May 7, 2020

Update yarn orb from 4.0.2 to v5 artsy/metaphysics#2358

Closed

Update production keepalive/header timeouts to avoid hung connections to ELB artsy/force#5522

Closed

liayoo mentioned this issue Jun 11, 2020

Set KeepAliveTimeout to 620000 (620secs) ainblockchain/ain-blockchain#5

Merged

Flarna mentioned this issue Jun 30, 2020

[v12.x backport] http: fix incorrect headersTimeout measurement #34131

Closed

4 tasks

mtrezza mentioned this issue Jul 8, 2020

Request timeout on Heroku after migrating to 2.8.4 and migration to MongoDB Atlas parse-community/parse-server#6778

Closed

kobelb mentioned this issue Oct 13, 2020

Missing HTTP status codes for cancelled search requests elastic/kibana#73849

Closed

tabuchid mentioned this issue Aug 30, 2021

Please expose keepAliveTimeout and headersTimeout without requiring a custom server vercel/next.js#28633

Closed

Miikis mentioned this issue Apr 6, 2022

feat(cli): allow configuration of http-server's timeout configuration vercel/next.js#35827

Merged

10 tasks

AdamKatzDev mentioned this issue Feb 17, 2023

feat(standalone): allow configuring KEEP_ALIVE_TIMEOUT via env var vercel/next.js#46052

Merged

codyebberson mentioned this issue Jun 19, 2023

Research response_processing_time -1 medplum/medplum#2253

Closed

tonylee80 mentioned this issue Jun 19, 2023

Fixes #2253 - increase server keep alive timeout medplum/medplum#2256

Merged

davidvitora mentioned this issue Dec 8, 2023

fix(core): header timeout botpress/v12#1798

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression issue with keep alive connections #27363

Regression issue with keep alive connections #27363

OrKoN commented Apr 23, 2019

BridgeAR commented Apr 24, 2019

bnoordhuis commented Apr 24, 2019

OrKoN commented Apr 24, 2019

shuhei commented Apr 27, 2019 •

edited

Loading

shuhei commented Apr 28, 2019 •

edited

Loading

OrKoN commented Apr 29, 2019

shuhei commented Apr 29, 2019

OrKoN commented Apr 29, 2019 •

edited

Loading

OrKoN commented Apr 29, 2019 •

edited

Loading

alaz commented May 1, 2019 •

edited

Loading

yoavain commented May 23, 2019

dansantner commented Oct 10, 2019

markfermor commented Oct 16, 2019 •

edited

Loading

kirillgroshkov commented Oct 30, 2019

michalschott commented Feb 18, 2020

sroze commented Mar 24, 2020

kirillgroshkov commented Mar 24, 2020

vrozov commented Apr 9, 2020

mcollina commented Apr 9, 2020

wyardley commented Apr 9, 2020

glasser commented May 14, 2020

yokomotod commented Jun 22, 2020

Xilis commented Oct 7, 2020

Samjin commented Dec 17, 2021 •

edited

Loading

Xilis commented Dec 17, 2021 •

edited

Loading

pankaz commented Feb 7, 2024 •

edited

Loading

hjr3 commented Apr 3, 2024

Regression issue with keep alive connections #27363

Regression issue with keep alive connections #27363

Comments

OrKoN commented Apr 23, 2019

BridgeAR commented Apr 24, 2019

bnoordhuis commented Apr 24, 2019

OrKoN commented Apr 24, 2019

shuhei commented Apr 27, 2019 • edited Loading

shuhei commented Apr 28, 2019 • edited Loading

OrKoN commented Apr 29, 2019

shuhei commented Apr 29, 2019

OrKoN commented Apr 29, 2019 • edited Loading

OrKoN commented Apr 29, 2019 • edited Loading

alaz commented May 1, 2019 • edited Loading

yoavain commented May 23, 2019

dansantner commented Oct 10, 2019

markfermor commented Oct 16, 2019 • edited Loading

kirillgroshkov commented Oct 30, 2019

michalschott commented Feb 18, 2020

sroze commented Mar 24, 2020

kirillgroshkov commented Mar 24, 2020

vrozov commented Apr 9, 2020

mcollina commented Apr 9, 2020

wyardley commented Apr 9, 2020

glasser commented May 14, 2020

yokomotod commented Jun 22, 2020

Xilis commented Oct 7, 2020

Samjin commented Dec 17, 2021 • edited Loading

Xilis commented Dec 17, 2021 • edited Loading

pankaz commented Feb 7, 2024 • edited Loading

hjr3 commented Apr 3, 2024

shuhei commented Apr 27, 2019 •

edited

Loading

shuhei commented Apr 28, 2019 •

edited

Loading

OrKoN commented Apr 29, 2019 •

edited

Loading

OrKoN commented Apr 29, 2019 •

edited

Loading

alaz commented May 1, 2019 •

edited

Loading

markfermor commented Oct 16, 2019 •

edited

Loading

Samjin commented Dec 17, 2021 •

edited

Loading

Xilis commented Dec 17, 2021 •

edited

Loading

pankaz commented Feb 7, 2024 •

edited

Loading