Envoy (re)uses connection after receiving FIN from upstream #6815

lvgilmore · 2019-05-06T13:41:46Z

Description:
With Envoy serving as HTTP/1.1 proxy, sometimes Envoy tries to reuse a connection even after receiving FIN from upstream. In production I saw this issue even with couple of seconds from FIN to next request, and Envoy never returned FIN ACK (just FIN from upstream to envoy, then PUSH with new HTTP request from Envoy to upstream). Then Envoy returns 503 UC even though upstream is up and operational.

Steps to reproduce shows a slightly modified version of this issue, as to encourage Envoy to reuse a connection I set circuit breakers to low max_connections and max_requests. Instead of returning UO, Envoy try to reuse a connection that received FIN and so we get 503 UC.

Repro steps:
Upstream is a simple web server that sleeps 2s, then returns simple json, running as simple docker container in same network as Envoy. Envoy is running as container as well, using version
8c8b068/1.11.0-dev/Clean/RELEASE/BoringSSL
client just cURL running ~ concurenlty 7 times.
We can see the error:

In logs it looks like

[2019-05-05T08:44:21.926Z] "GET /6 HTTP/1.1" 503 UO 0 81 1 - "-" "curl/7.54.0" "bd5c20e6-10a4-4654-81fb-c2ed46d175fe" "localhost:9080" "-"
[2019-05-05T08:44:21.909Z] "GET /shit HTTP/1.1" 200 - 0 16 2007 2004 "-" "curl/7.54.0" "1bbcaca7-8b33-43fb-9cf2-ef26841b5321" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.915Z] "GET /3 HTTP/1.1" 200 - 0 16 2004 2004 "-" "curl/7.54.0" "7d19bc5f-a267-43c3-b3c0-6a042b24a9f5" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.917Z] "GET /1 HTTP/1.1" 503 UC 0 95 2003 - "-" "curl/7.54.0" "23764aa2-2f3e-4ea0-bfed-dd0bb9b977b3" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.920Z] "GET /2 HTTP/1.1" 200 - 0 16 2005 2004 "-" "curl/7.54.0" "10a76e91-cdf0-4560-892c-4b64c8462202" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.925Z] "GET /5 HTTP/1.1" 200 - 0 16 4010 4008 "-" "curl/7.54.0" "214df7d2-da2b-4f71-ad4e-f4c02c30d7d8" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.926Z] "GET /4 HTTP/1.1" 200 - 0 16 6017 6017 "-" "curl/7.54.0" "a98e7bdd-6de7-4165-bea1-0f248cf55f9f" "localhost:9080" "172.22.0.3:8080"

of most interest is the 4th line, GET /1 HTTP/1.1" 503 UC

Config:
envoy config

admin:
  access_log_path: /dev/stdout
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9000

node:
  cluster: mycluster
  id: test-id

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 9080 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          access_log:
            config: {path: /dev/stdout}
            name: envoy.file_access_log
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: upstream }
          http_filters:
          - name: envoy.router

  clusters:
  - name: upstream
    type: STRICT_DNS
    connect_timeout: 0.25s
    hosts: [{ socket_address: { address: 'upstream' , port_value: 8080 }}]
    circuit_breakers:
      thresholds:
      - {max_connections: 2, max_pending_requests: 3, max_requests: 4,
        max_retries: 0, priority: default}

Logs:
trace.log

The text was updated successfully, but these errors were encountered:

alyssawilk · 2019-05-06T14:26:33Z

I'm a bit confused at this trace. I'm reading "Upstream sends Envoy an GET" and "Envoy sends the upstream 1 200/OK". Do you have the names inverted in your trace?

Assuming so, I believe there's always going to be a race condition (helped by but not entirely fixed by #2871) where Envoy may reuse the connection before processing the FIN. In your case it looks like you're sending HTTP/1.0 responses where the connection should not by default be reused. You could file a feature request for better upstream HTTP/1.0 handling, or you could configure retries to handle the race.

stale · 2019-06-05T16:04:42Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2019-06-12T16:38:30Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

winguse · 2020-03-16T03:17:43Z

I observed this issue in our production recently.

We are using an older version, v1.9.0, and the backend is a node.js service.

As you can see the packages, the node.js send a FIN and envoy already ack the FIN but still sending a new HTTP request to upstream and the node.js send with RST and terminate the request.

The downstream of envoy will see upstream connect error or disconnect/reset before headers (HTTP 503).

Our cluster config is simple:

name: local_cluster
connect_timeout: 15s
type: STATIC
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
    address: 127.0.0.1
    port_value: 15225
health_checks:
- timeout: 5s
  interval: 10s
  healthy_threshold: 1
  unhealthy_threshold: 2
  reuse_connection: true
  http_health_check:
    path: /api/health

I believe this is a race condition.

winguse · 2020-03-16T04:12:52Z

we walk around this by adding:

common_http_protocol_options:
  # node.js is going to close the connection when idle 5s,
  # we set it to 4s to let envoy close the conn early
  idle_timeout: 4s

But i believe this should be permanently fixed in Envoy

winguse · 2020-03-17T02:13:37Z

hi @alyssawilk , would you review if the issue is till valid to you?

alyssawilk · 2020-03-18T17:19:19Z

Hey @winguse
Assuming I'm reading this right, I see your node.js sending a FIN and Envoy using that connection. There's no indication that Envoy had gotten around to reading the FIN before it posted to the open connection though so I'm not sure why this would be anything but the race previously discussed, unless I'm missing something?
Any time your upstream might time out idle connections before your downstream you can run into race conditions like this - I think in this case adjusting the timeouts to avoid the race make sense and I'm glad it works for you.

winguse · 2020-03-23T11:14:57Z

Thanks @alyssawilk

Any time your upstream might time out idle connections before your downstream you can run into race conditions like this

As we can see in the screenshot, we see envoy had acked the FIN and sending the request again after 1.4 second, so I guess this race condition can be avoidable. Is that valid to you?

On the other hand, in our case, we find this happen quite frequently.

alyssawilk · 2020-03-24T20:12:12Z

Acked what fin? I don't see upstream sending a FIN. I had assumed Envoy initiated connection close - it's possible for a close to be FIN+Ack-of-prior-data. From the trace I'm not seeing this being [FIN + ack-of-peer-fin]

What I see is Envoy sending an HTTP/1.0 response, without keep-alive, which means the connection will be implicitly closed. It then closes the connection with a FIN and the peer attempts to send an illegal follow-up request (/1) and gets RST.

Not clear to me why Envoy is sending a 1.0 response to a 1.1 request, or why your Envoy is acting as a server to an upstream client (did you mean downstream?) but I don't see anything problematic at the TCP level. I think your peer needs to understand to not reuse a TCP connection if getting an HTTP/1.0 response without "connection: keep-alive"

Are you using an older version of Envoy? There was a bug with HTTP/1.0 processing where it used to latch "speaking HTTP/1.0" and that caused some confusion, which might be causing the downgrade behavior I observe, and might clear up your problem if your peer isn't paying close attention to the version it's getting back.

winguse · 2020-03-25T06:36:47Z

@alyssawilk sorry, it's my bad that I have not clear described my situation.

See my #6815 (comment) , the last 4 packages:

Envoy did acked the FIN but started to send another response after 1.4 second.

And in my case, Envoy <--> Node.js are speaking HTTP/1.1 to each other.

This issue was found in 1.9, I am not sure if this still happen to latest build of envoy.

alyssawilk · 2020-03-25T14:24:35Z

I see Envoy sending [FIN ACK]. That does not mean it's acking a fin. It can be sending a fin and acking prior data. I do not see node.js sending a FIN, so I believe the [FIN, ACK] is [FIN+prior-data-ack] not [FIN + ack-of-peer-fin] as I said above.

If you're running 1.9 you are almost certainly encountering this
#7464
Envoy is closing the connection (as I read the trace) because it is speaking HTTP/1.0. If you update it should fix your problem!

winguse · 2020-03-26T04:17:15Z

a few notes:

I see Envoy sending [FIN ACK]. That does not mean it's acking a fin. It can be sending a fin and acking prior data. I do not see node.js sending a FIN, so I believe the [FIN, ACK] is [FIN+prior-data-ack] not [FIN + ack-of-peer-fin] as I said above.

In #6815 (comment) , Node.js is listening on port 15225, and Envoy start the connection from port 45146, which the [FIN, ACK] was send by Node.js.

But in anyway, if we have upgrade to latest Envoy, I will let you know if we still seeing this problem.

Thank you!

discovery-NukulSharma · 2020-11-25T14:35:42Z

We are facing this issue on envoy 1.15, where server is sending FIN, but sidecar/envoy is not passing it back to downstream/client.
Any idea to get away with this, we would like to avoid keeping small timeouts than service.
@alyssawilk

alyssawilk · 2020-11-30T18:27:14Z

Sorry, need a bit more detail to be able to help here.
I assume upstream is sending a TCP FIN, and you want the downstream connection to be FIN'd as well?
That will depend on what version of the protocols you're using. If upstream is framing a response by connection close, for example, the downstream connection may do chunked framing at which point Envoy would send a terminal chunk, not a FIN.
If you can give a bit more detail I can help you sort out if this is a bug or a "feature" and maybe how you might work around it (though FYI may be laggy this week, as I'm partially out)

alyssawilk added the question Questions that are neither investigations, bugs, nor enhancements label May 6, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 5, 2019

stale bot closed this as completed Jun 12, 2019

smagnani96 mentioned this issue Jan 10, 2025

plain TCP RST packet in pod-to-pod-with-l7-policy-encryption for IPv6+IPSec+VXLan cilium/cilium#35485

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy (re)uses connection after receiving FIN from upstream #6815

Envoy (re)uses connection after receiving FIN from upstream #6815

lvgilmore commented May 6, 2019

alyssawilk commented May 6, 2019

stale bot commented Jun 5, 2019

stale bot commented Jun 12, 2019

winguse commented Mar 16, 2020

winguse commented Mar 16, 2020

winguse commented Mar 17, 2020

alyssawilk commented Mar 18, 2020

winguse commented Mar 23, 2020

alyssawilk commented Mar 24, 2020

winguse commented Mar 25, 2020

alyssawilk commented Mar 25, 2020

winguse commented Mar 26, 2020

discovery-NukulSharma commented Nov 25, 2020

alyssawilk commented Nov 30, 2020

Envoy (re)uses connection after receiving FIN from upstream #6815

Envoy (re)uses connection after receiving FIN from upstream #6815

Comments

lvgilmore commented May 6, 2019

alyssawilk commented May 6, 2019

stale bot commented Jun 5, 2019

stale bot commented Jun 12, 2019

winguse commented Mar 16, 2020

winguse commented Mar 16, 2020

winguse commented Mar 17, 2020

alyssawilk commented Mar 18, 2020

winguse commented Mar 23, 2020

alyssawilk commented Mar 24, 2020

winguse commented Mar 25, 2020

alyssawilk commented Mar 25, 2020

winguse commented Mar 26, 2020

discovery-NukulSharma commented Nov 25, 2020

alyssawilk commented Nov 30, 2020