Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy (re)uses connection after receiving FIN from upstream #6815

Closed
lvgilmore opened this issue May 6, 2019 · 14 comments
Closed

Envoy (re)uses connection after receiving FIN from upstream #6815

lvgilmore opened this issue May 6, 2019 · 14 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@lvgilmore
Copy link

Description:
With Envoy serving as HTTP/1.1 proxy, sometimes Envoy tries to reuse a connection even after receiving FIN from upstream. In production I saw this issue even with couple of seconds from FIN to next request, and Envoy never returned FIN ACK (just FIN from upstream to envoy, then PUSH with new HTTP request from Envoy to upstream). Then Envoy returns 503 UC even though upstream is up and operational.

Steps to reproduce shows a slightly modified version of this issue, as to encourage Envoy to reuse a connection I set circuit breakers to low max_connections and max_requests. Instead of returning UO, Envoy try to reuse a connection that received FIN and so we get 503 UC.

Repro steps:
Upstream is a simple web server that sleeps 2s, then returns simple json, running as simple docker container in same network as Envoy. Envoy is running as container as well, using version
8c8b068/1.11.0-dev/Clean/RELEASE/BoringSSL
client just cURL running ~ concurenlty 7 times.
We can see the error:
image
In logs it looks like

[2019-05-05T08:44:21.926Z] "GET /6 HTTP/1.1" 503 UO 0 81 1 - "-" "curl/7.54.0" "bd5c20e6-10a4-4654-81fb-c2ed46d175fe" "localhost:9080" "-"
[2019-05-05T08:44:21.909Z] "GET /shit HTTP/1.1" 200 - 0 16 2007 2004 "-" "curl/7.54.0" "1bbcaca7-8b33-43fb-9cf2-ef26841b5321" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.915Z] "GET /3 HTTP/1.1" 200 - 0 16 2004 2004 "-" "curl/7.54.0" "7d19bc5f-a267-43c3-b3c0-6a042b24a9f5" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.917Z] "GET /1 HTTP/1.1" 503 UC 0 95 2003 - "-" "curl/7.54.0" "23764aa2-2f3e-4ea0-bfed-dd0bb9b977b3" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.920Z] "GET /2 HTTP/1.1" 200 - 0 16 2005 2004 "-" "curl/7.54.0" "10a76e91-cdf0-4560-892c-4b64c8462202" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.925Z] "GET /5 HTTP/1.1" 200 - 0 16 4010 4008 "-" "curl/7.54.0" "214df7d2-da2b-4f71-ad4e-f4c02c30d7d8" "localhost:9080" "172.22.0.3:8080"
[2019-05-05T08:44:21.926Z] "GET /4 HTTP/1.1" 200 - 0 16 6017 6017 "-" "curl/7.54.0" "a98e7bdd-6de7-4165-bea1-0f248cf55f9f" "localhost:9080" "172.22.0.3:8080"

of most interest is the 4th line, GET /1 HTTP/1.1" 503 UC

Config:
envoy config

admin:
  access_log_path: /dev/stdout
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9000

node:
  cluster: mycluster
  id: test-id

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 9080 }
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          access_log:
            config: {path: /dev/stdout}
            name: envoy.file_access_log
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: upstream }
          http_filters:
          - name: envoy.router

  clusters:
  - name: upstream
    type: STRICT_DNS
    connect_timeout: 0.25s
    hosts: [{ socket_address: { address: 'upstream' , port_value: 8080 }}]
    circuit_breakers:
      thresholds:
      - {max_connections: 2, max_pending_requests: 3, max_requests: 4,
        max_retries: 0, priority: default}

Logs:
trace.log

@alyssawilk
Copy link
Contributor

I'm a bit confused at this trace. I'm reading "Upstream sends Envoy an GET" and "Envoy sends the upstream 1 200/OK". Do you have the names inverted in your trace?

Assuming so, I believe there's always going to be a race condition (helped by but not entirely fixed by #2871) where Envoy may reuse the connection before processing the FIN. In your case it looks like you're sending HTTP/1.0 responses where the connection should not by default be reused. You could file a feature request for better upstream HTTP/1.0 handling, or you could configure retries to handle the race.

@alyssawilk alyssawilk added the question Questions that are neither investigations, bugs, nor enhancements label May 6, 2019
@stale
Copy link

stale bot commented Jun 5, 2019

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 5, 2019
@stale
Copy link

stale bot commented Jun 12, 2019

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jun 12, 2019
@winguse
Copy link
Contributor

winguse commented Mar 16, 2020

I observed this issue in our production recently.

image

We are using an older version, v1.9.0, and the backend is a node.js service.

As you can see the packages, the node.js send a FIN and envoy already ack the FIN but still sending a new HTTP request to upstream and the node.js send with RST and terminate the request.

The downstream of envoy will see upstream connect error or disconnect/reset before headers (HTTP 503).

Our cluster config is simple:

name: local_cluster
connect_timeout: 15s
type: STATIC
lb_policy: ROUND_ROBIN
hosts:
- socket_address:
    address: 127.0.0.1
    port_value: 15225
health_checks:
- timeout: 5s
  interval: 10s
  healthy_threshold: 1
  unhealthy_threshold: 2
  reuse_connection: true
  http_health_check:
    path: /api/health

I believe this is a race condition.

@winguse
Copy link
Contributor

winguse commented Mar 16, 2020

we walk around this by adding:

common_http_protocol_options:
  # node.js is going to close the connection when idle 5s,
  # we set it to 4s to let envoy close the conn early
  idle_timeout: 4s 

But i believe this should be permanently fixed in Envoy

@winguse
Copy link
Contributor

winguse commented Mar 17, 2020

hi @alyssawilk , would you review if the issue is till valid to you?

@alyssawilk
Copy link
Contributor

Hey @winguse
Assuming I'm reading this right, I see your node.js sending a FIN and Envoy using that connection. There's no indication that Envoy had gotten around to reading the FIN before it posted to the open connection though so I'm not sure why this would be anything but the race previously discussed, unless I'm missing something?
Any time your upstream might time out idle connections before your downstream you can run into race conditions like this - I think in this case adjusting the timeouts to avoid the race make sense and I'm glad it works for you.

@winguse
Copy link
Contributor

winguse commented Mar 23, 2020

Thanks @alyssawilk

Any time your upstream might time out idle connections before your downstream you can run into race conditions like this

As we can see in the screenshot, we see envoy had acked the FIN and sending the request again after 1.4 second, so I guess this race condition can be avoidable. Is that valid to you?

On the other hand, in our case, we find this happen quite frequently.

@alyssawilk
Copy link
Contributor

Acked what fin? I don't see upstream sending a FIN. I had assumed Envoy initiated connection close - it's possible for a close to be FIN+Ack-of-prior-data. From the trace I'm not seeing this being [FIN + ack-of-peer-fin]

What I see is Envoy sending an HTTP/1.0 response, without keep-alive, which means the connection will be implicitly closed. It then closes the connection with a FIN and the peer attempts to send an illegal follow-up request (/1) and gets RST.

Not clear to me why Envoy is sending a 1.0 response to a 1.1 request, or why your Envoy is acting as a server to an upstream client (did you mean downstream?) but I don't see anything problematic at the TCP level. I think your peer needs to understand to not reuse a TCP connection if getting an HTTP/1.0 response without "connection: keep-alive"

Are you using an older version of Envoy? There was a bug with HTTP/1.0 processing where it used to latch "speaking HTTP/1.0" and that caused some confusion, which might be causing the downgrade behavior I observe, and might clear up your problem if your peer isn't paying close attention to the version it's getting back.

@winguse
Copy link
Contributor

winguse commented Mar 25, 2020

@alyssawilk sorry, it's my bad that I have not clear described my situation.

See my #6815 (comment) , the last 4 packages:

image

Envoy did acked the FIN but started to send another response after 1.4 second.

And in my case, Envoy <--> Node.js are speaking HTTP/1.1 to each other.

This issue was found in 1.9, I am not sure if this still happen to latest build of envoy.

@alyssawilk
Copy link
Contributor

I see Envoy sending [FIN ACK]. That does not mean it's acking a fin. It can be sending a fin and acking prior data. I do not see node.js sending a FIN, so I believe the [FIN, ACK] is [FIN+prior-data-ack] not [FIN + ack-of-peer-fin] as I said above.

If you're running 1.9 you are almost certainly encountering this
#7464
Envoy is closing the connection (as I read the trace) because it is speaking HTTP/1.0. If you update it should fix your problem!

@winguse
Copy link
Contributor

winguse commented Mar 26, 2020

a few notes:

I see Envoy sending [FIN ACK]. That does not mean it's acking a fin. It can be sending a fin and acking prior data. I do not see node.js sending a FIN, so I believe the [FIN, ACK] is [FIN+prior-data-ack] not [FIN + ack-of-peer-fin] as I said above.

In #6815 (comment) , Node.js is listening on port 15225, and Envoy start the connection from port 45146, which the [FIN, ACK] was send by Node.js.

But in anyway, if we have upgrade to latest Envoy, I will let you know if we still seeing this problem.

Thank you!

@discovery-NukulSharma
Copy link

We are facing this issue on envoy 1.15, where server is sending FIN, but sidecar/envoy is not passing it back to downstream/client.
Any idea to get away with this, we would like to avoid keeping small timeouts than service.
@alyssawilk

@alyssawilk
Copy link
Contributor

Sorry, need a bit more detail to be able to help here.
I assume upstream is sending a TCP FIN, and you want the downstream connection to be FIN'd as well?
That will depend on what version of the protocols you're using. If upstream is framing a response by connection close, for example, the downstream connection may do chunked framing at which point Envoy would send a terminal chunk, not a FIN.
If you can give a bit more detail I can help you sort out if this is a bug or a "feature" and maybe how you might work around it (though FYI may be laggy this week, as I'm partially out)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants