Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

aqua777 · 2021-02-08T16:31:23Z

Title: Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination})

Description:

What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should not crash, the expected value isn't
returned, etc.

We have a group of Envoy servers running as an Edge Proxy for a number of backend services. We configure these Envoy servers using in-house built xDS GRPC server. See example configuration dump attached to this issue.

We started noticing intermittent responses with error code 503 and response flag "UC" + upstream_reset_before_response_started{connection_termination} appearing in response code details. It seems that these responses are produced by Envoy and not returned by the backend services. We observed this issue while running Envoy versions v1.16 and v1.17.

We looked at some previously reported similar GitHub issues, such as:

We tried to change some configuration settings, namely:

enable TCP Keep-alives
adjust Idle Timeouts (for ex. use default values, disable idle timeout)

None of the actions mentioned above had any influence on the behavior of our Envoys' fleet.

Repro steps:
We have no repeatable method for reproducing these errors. They are very sporadic, ie. 1-2 per 100000 requests. They are spread across various clusters. Some of these clusters receive large and constant amount of requests so the possibility of the upstream connections to become idle is very unlikely.

Config:
See included in tarball attached to this issue

Logs:
This issue occurs in our Production environment where we do not have debug level logging enabled.

Can you please advise:

why does the Envoy terminates connections?
have this problem been reported by other Envoy's users?
is there a possibility that requests are not distributed evenly across all backend servers which could lead to Envoy resetting upstream connections?
what can we do on our side to eliminate this problem?

The text was updated successfully, but these errors were encountered:

aqua777 · 2021-02-08T16:31:53Z

envoy_503_upstream_reset_before_response_started.tar.gz

antoniovicente · 2021-02-08T19:20:33Z

cc @alyssawilk

One possible explanation for this class of problems may be that the upstream server closes the connection as the proxy starts sending the request. It may be helpful to know how long the upstream connection was open prior to the first request being sent on it.

aqua777 · 2021-02-08T21:22:40Z

cc @alyssawilk

One possible explanation for this class of problems may be that the upstream server closes the connection as the proxy starts sending the request. It may be helpful to know how long the upstream connection was open prior to the first request being sent on it.

Thanks for possible explanation. We were considering this and there are factors that are against this assumption. Prior to using Envoys our load balancing were based on hardware Load Balancers. With hardware LBs we haven't seen such behavior, ie. connection terminations by backend servers. After switching to Envoys we started seeing behavior described in this issue across multiple clusters, large and small, using different webservers, serving various amount of requests, all kinds of differences.

yanavlasov · 2021-02-09T02:37:48Z

The error code indicates that it not Envoy that terminated the connection but rather the upstream server (or intervening network infrastructure). Did you look at the counters? For instance any of the upstream_cx_... error counters corresponding to disconnects?

aqua777 · 2021-02-09T08:20:07Z

@yanavlasov thank you for suggestion. In our infrastructure we have some restrictions in how many metrics we can send to Prometheus servers so we had to configure Envoy to use inclusion patterns for metrics. upstream_cx_destroy... metrics are not on that list, we will add them and will provide more details shortly.

aqua777 · 2021-02-09T15:50:24Z

We've managed to capture one of the requests resulting in 503 response code, see the output below:

[2021-02-09 14:23:14.220][24][debug][conn_handler] [source/server/connection_handler_impl.cc:501] [C1876] new connection
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:254] [C1876] new stream
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:886] [C1876][S15748738241223205728] request headers complete (end_stream=true):
':authority', '...'
':path', '/...'
':method', 'GET'
'x-jwt', '***'
'accept', 'application/json'
'connection', 'close'

[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/filter_manager.cc:755] [C1876][S15748738241223205728] request end stream
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:425] [C1876][S15748738241223205728] cluster 'some_cluster' match for URL '/...'
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:582] [C1876][S15748738241223205728] router decoding headers:
':authority', '...'
':path', '/...'
':method', 'GET'
':scheme', 'http'
'x-jwt', '***'
'accept', 'application/json'
'x-forwarded-for', 'x.x.x.x'
'x-forwarded-proto', 'https'
'x-envoy-internal', 'true'
'x-request-id', 'f3859bd9-2f6b-4e97-af6a-1ead24aef8a7'
'x-envoy-expected-rq-timeout-ms', '303000'

[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/upstream_request.cc:354] [C1876][S15748738241223205728] pool ready
[2021-02-09 14:23:14.223][24][debug][router] [source/common/router/router.cc:1026] [C1876][S15748738241223205728] upstream reset: reset reason: connection termination, transport failure reason:
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/filter_manager.cc:839] [C1876][S15748738241223205728] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1429] [C1876][S15748738241223205728] closing connection due to connection close header
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C1876][S15748738241223205728] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Tue, 09 Feb 2021 14:23:13 GMT'
'server', 'envoy'
'connection', 'close'

[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:132] [C1876] closing data_to_write=248 type=2
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl_base.cc:47] [C1876] setting delayed close timer with timeout 1000 ms
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:696] [C1876] write flush complete

alyssawilk · 2021-02-09T16:49:08Z

So offhand this still looks like an upstream disconnect. Any chance you can capture a package trace to see if that's what is happening? Are you using HTTP/1.1 by any chance? There's a known race where if upstream sends response1 and closes the connection _without_ sending normal connection close headers, that Enovy will read the response and can reassign the connection to a new stream before picking up the tcp FIN, resulting in near immediate reset.

…

On Tue, Feb 9, 2021 at 10:50 AM aqua777 ***@***.***> wrote: We've managed to capture one of the requests resulting in 503 response code, see the output below: [2021-02-09 14:23:14.220][24][debug][conn_handler] [source/server/connection_handler_impl.cc:501] [C1876] new connection [2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:254] [C1876] new stream [2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:886] [C1876][S15748738241223205728] request headers complete (end_stream=true): ':authority', '...' ':path', '/...' ':method', 'GET' 'x-jwt', '***' 'accept', 'application/json' 'connection', 'close' [2021-02-09 14:23:14.222][24][debug][http] [source/common/http/filter_manager.cc:755] [C1876][S15748738241223205728] request end stream [2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:425] [C1876][S15748738241223205728] cluster 'some_cluster' match for URL '/...' [2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:582] [C1876][S15748738241223205728] router decoding headers: ':authority', '...' ':path', '/...' ':method', 'GET' ':scheme', 'http' 'x-jwt', '***' 'accept', 'application/json' 'x-forwarded-for', '10.64.x.x' 'x-forwarded-proto', 'https' 'x-envoy-internal', 'true' 'x-request-id', 'f3859bd9-2f6b-4e97-af6a-1ead24aef8a7' 'x-envoy-expected-rq-timeout-ms', '303000' [2021-02-09 14:23:14.222][24][debug][router] [source/common/router/upstream_request.cc:354] [C1876][S15748738241223205728] pool ready [2021-02-09 14:23:14.223][24][debug][router] [source/common/router/router.cc:1026] [C1876][S15748738241223205728] upstream reset: reset reason: connection termination, transport failure reason: [2021-02-09 14:23:14.223][24][debug][http] [source/common/http/filter_manager.cc:839] [C1876][S15748738241223205728] Sending local reply with details upstream_reset_before_response_started{connection termination} [2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1429] [C1876][S15748738241223205728] closing connection due to connection close header [2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C1876][S15748738241223205728] encoding headers via codec (end_stream=false): ':status', '503' 'content-length', '95' 'content-type', 'text/plain' 'date', 'Tue, 09 Feb 2021 14:23:13 GMT' 'server', 'envoy' 'connection', 'close' [2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:132] [C1876] closing data_to_write=248 type=2 [2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl_base.cc:47] [C1876] setting delayed close timer with timeout 1000 ms [2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:696] [C1876] write flush complete — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14981 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AELALPKR3CDHEKREUCQBYM3S6FKVDANCNFSM4XJJRKNA> .

aqua777 · 2021-02-09T17:28:03Z

@alyssawilk AFAIK, these requests are using HTTP/1.1 indeed. By upstream disconnect do you mean that in your interpretation the connection is being reset by the upstream side?

What would you recommend to capture package trace? Something like tcpdump or do you have other tools in mind?

alyssawilk · 2021-02-09T17:37:35Z

Yeah, so if I had to bet I'd guess your upstream is sending a response without a connection close header, then closing the connection. a TCP dump will tell you if this is indeed what's happening (if upstream is sending that TCP FIN), and confirm or deny you're hitting the race I mentioned. Unfortunately there's not a ton Envoy can do here today - generally we encourage folks to make sure if the upstream is closing HTTP/1.1 connections it use the connection: close header (which will cause Envoy to not reuse the connection, and avoid the race entirely).

…

On Tue, Feb 9, 2021 at 12:28 PM aqua777 ***@***.***> wrote: @alyssawilk <https://github.com/alyssawilk> AFAIK, these requests are using HTTP/1.1 indeed. By upstream disconnect do you mean that in your interpretation the connection is being reset by the upstream side? What would you recommend to capture package trace? Something like tcpdump or do you have other tools in mind? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14981 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AELALPNLD44Z47K4BAPV3RLS6FWDVANCNFSM4XJJRKNA> .

github-actions · 2021-03-14T20:03:09Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2021-03-21T20:03:21Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

ngtuna · 2021-05-09T17:40:48Z

@aqua777 So did you finally manage to figure out the root cause ? I am getting quite similar issue today. It seems like @alyssawilk explained; but I don't know how to fix it. Any hint ? Thanks!

[2021-05-10 00:37:31.782][3549452][debug][router] [source/common/router/router.cc:425] [C0][S15323405275991739010] cluster '...' match for URL '...'
[2021-05-10 00:37:31.783][3549452][debug][router] [source/common/router/router.cc:582] [C0][S15323405275991739010] router decoding headers:
':authority', '...:8443'
':path', '...'
':method', 'GET'
':scheme', 'http'
'user-agent', 'curl/7.64.1'
'accept', '*/*'
'x-forwarded-proto', 'https'
'x-request-id', 'a0fbb5a1-6680-4274-a213-2fc3184a0c71'
'x-envoy-expected-rq-timeout-ms', '15000'

[2021-05-10 00:37:31.783][3549452][debug][pool] [source/common/http/conn_pool_base.cc:79] queueing stream due to no available connections
[2021-05-10 00:37:31.783][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:106] creating a new connection
[2021-05-10 00:37:31.783][3549452][debug][client] [source/common/http/codec_client.cc:41] [C1] connecting
[2021-05-10 00:37:31.783][3549452][debug][connection] [source/common/network/connection_impl.cc:860] [C1] connecting to 10.28.1.4:8443
[2021-05-10 00:37:31.783][3549452][debug][connection] [source/common/network/connection_impl.cc:876] [C1] connection in progress
[2021-05-10 00:37:31.784][3549452][trace][http] [source/common/http/filter_manager.cc:498] [C0][S15323405275991739010] decode headers called: filter=0xca6da00 status=1
[2021-05-10 00:37:31.784][3549452][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0xca6d980 for 300000ms, min is 300000ms
[2021-05-10 00:37:31.784][3549452][trace][http] [source/common/http/http1/codec_impl.cc:620] [C0] parsed 155 bytes
[2021-05-10 00:37:31.784][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C0] socket event: 2
[2021-05-10 00:37:31.784][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C0] write ready
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 2
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.062][3549452][debug][connection] [source/common/network/connection_impl.cc:666] [C1] connected
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:407] [C1] raising connection event 2
[2021-05-10 00:37:32.062][3549452][debug][client] [source/common/http/codec_client.cc:80] [C1] connected
[2021-05-10 00:37:32.063][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:225] [C1] attaching to next stream
[2021-05-10 00:37:32.063][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:130] [C1] creating stream
[2021-05-10 00:37:32.063][3549452][debug][router] [source/common/router/upstream_request.cc:354] [C0][S15323405275991739010] pool ready
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:465] [C1] writing 291 bytes, end_stream false
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/raw_buffer_socket.cc:68] [C1] write returns: 291
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 2
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 3
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:586] [C1] read ready. dispatch_buffered_data=false
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/raw_buffer_socket.cc:25] [C1] read returns: 0
[2021-05-10 00:37:32.472][3549452][debug][connection] [source/common/network/connection_impl.cc:634] [C1] remote close
[2021-05-10 00:37:32.472][3549452][debug][connection] [source/common/network/connection_impl.cc:241] [C1] closing socket: 0
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:407] [C1] raising connection event 0
[2021-05-10 00:37:32.472][3549452][trace][http] [source/common/http/http1/codec_impl.cc:571] [C1] parsing 0 bytes
[2021-05-10 00:37:32.472][3549452][trace][http] [source/common/http/http1/codec_impl.cc:620] [C1] parsed 0 bytes
[2021-05-10 00:37:32.472][3549452][debug][client] [source/common/http/codec_client.cc:99] [C1] disconnect. resetting 1 pending requests
[2021-05-10 00:37:32.472][3549452][debug][client] [source/common/http/codec_client.cc:125] [C1] request reset
[2021-05-10 00:37:32.472][3549452][trace][main] [source/common/event/dispatcher_impl.cc:213] item added to deferred deletion list (size=1)
[2021-05-10 00:37:32.472][3549452][debug][router] [source/common/router/router.cc:1026] [C0][S15323405275991739010] upstream reset: reset reason: connection termination, transport failure reason:
[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/filter_manager.cc:839] [C0][S15323405275991739010] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2021-05-10 00:37:32.476][3549452][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0xca6d980 for 300000ms, min is 300000ms
[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C0][S15323405275991739010] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Sun, 09 May 2021 17:37:32 GMT'
'server', 'envoy'

ngtuna · 2021-05-09T19:01:02Z

This line makes me very confused

[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/filter_manager.cc:839] [C0][S15323405275991739010] Sending local reply with details upstream_reset_before_response_started{connection termination}

alyssawilk · 2021-05-10T16:26:41Z

from
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/response_code_details

The upstream connection was reset after a response was started. This may include further details about the cause of the disconnect.

So theoretically your upstream is closing the connection before it sends any part of the response.

If there's areas we can improve docs here to make it easier to understand, please let us know!

elias4ty · 2022-01-13T08:28:05Z

I encounter this problem too! my upstream server is nginx，and the log of nginx that back to envoy is : "PRI * HTTP/2.0" 400 157 "-" "-" "-". That is so wired that envoy send a request to nginx base on http1.1, why nginx treat it as http2.0 or grpc request

ivpr · 2022-04-24T13:33:21Z

@aqua777 Were you able to identify the problem? We are facing a similar issue in our production
@alyssawilk Is configuring envoy to retry on reset right way to workaround the issue?

rajetta · 2022-05-05T03:08:17Z

@ivpr we are getting similar error after updating envoy to 1.17 - in our case , it is for every request. Did configuring retry on reset helped?

jbilliau-rcd · 2022-06-13T19:34:46Z

We are receiving this as well, 3 requests out of 24,000+ had a 503 UC error, not sure why. Pods are all healthy, have been up for days, resources all low....anyone figure this out?

benedikt-roth · 2022-07-11T13:18:33Z

We are also facing this issue and have no idea why. All pods are up and worked fine in the past. After Envoy integration, this changed and we now see above errors.

aqua777 · 2022-07-11T13:33:51Z

We ended up implementing Envoy configuration to retry the requests once when connection to backend servers is reset. This eliminated the majority of errors although we still get some.

cysnet · 2022-07-26T10:28:09Z

@aqua777 why not paste your fix solution to here to help the other users?

aqua777 · 2022-07-27T14:47:28Z

We just followed Envoy documentation on Retry policies:

match:
  prefix: /some/url/prefix
route:
  retry_policy:
    retry_on: reset

You can also specify num_retries which by default is 1 which is how we wanted it hence we didn't add it explicitly to configuration.

xiexianbin · 2022-08-31T15:00:10Z

I also meet this error, and find ssl expired logs in istio-ingressgateway

2022-08-31T12:40:11.890424Z	debug	envoy pool	[C18719] client disconnected, failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED 33554464:system library:OPENSSL_internal:Broken pipe

so I delete ingressgateway pod kubectl -n istio-system delete pod/istio-ingressgateway-xxx, retry is ok

chuongnguyen5397 · 2022-09-05T14:31:37Z

I'm facing the same issue, do we have any conclusion about the reason?

Here is my config:

  static_resources:
    listeners:
      - name: listener
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 10000
        filter_chains:
          - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config: 
                '@type': "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
                stat_prefix: http_proxy
                access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                route_config: 
                  name: all
                  virtual_hosts:
                    - name: local_service
                      domains: 
                        - '*'
                      routes:
                        - match: { prefix: "/api/v1/intranet"}
                          route:
                            prefix_rewrite: "/api/v1"
                            host_rewrite_literal: "random.execute-api.ap-southeast-1.amazonaws.com"
                            cluster: allbackend_cluster
                http_filters:
                    - name: envoy.filters.http.router
    clusters:
      - name: allbackend_cluster
        connect_timeout: 10s
        type: strict_dns
        lb_policy: round_robin
        load_assignment:
          cluster_name: allbackend_cluster
          endpoints:
            - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: random.execute-api.ap-southeast-1.amazonaws.com
                      port_value: 443
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

And the logs are showing

[2022-09-05 14:28:52.796][19][debug][router] [source/common/router/router.cc:1091] [C20][S2573947341148914084] upstream reset: reset reason: connection termination, transport failure reason: 
[2022-09-05 14:28:52.796][19][debug][http] [source/common/http/filter_manager.cc:883] [C20][S2573947341148914084] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-09-05 14:28:52.796][19][debug][http] [source/common/http/conn_manager_impl.cc:1455] [C20][S2573947341148914084] encoding headers via codec (end_stream=false):

cyanogen12 · 2022-11-30T08:06:48Z

Hi All,

I am also facing similar issue with Envoy version 1.24.
I am running envoy on my local machine - Macbook Pro

I have created three upstream clusters:
Cluster 1 & 2 are hosted locally on the same machine as https endpoints. and they are working well.

For third cluster- "httpbin" I am using public service https://httpbin.org/ip and this also works well

There is another public https api which I want to consume and it always fails while works well when invoked directly from terminal/ postman. Cluster - apigee_cluster
Scenario 1: When force http1 in cluster using configuration, I get response 404 while the endpoint actually exists:
explicit_http_config:
http_protocol_options: {}
Endpoint I am trying to invoke is : https://a623954-eval-test.apigee.net/ip
Refer Log file: Envoy_Failure_http1.log
Envoy_Failure_http1.log

Scenario 2: When force http2 in cluster using configuration, I get response 502 UPE while the endpoint actually exists:
explicit_http_config:
http2_protocol_options: {}
Endpoint I am trying to invoke is : https://a623954-eval-test.apigee.net/ip
Refer Log file: Envoy_Failure_http2.log
Envoy_Failure_http2.log

Scenario 3: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http1 option, I get 503 UC error.

Envoy_Failure_stage-api-fil-com_http1.log

Scenario 4: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http2 option, I get 502 UPE error.

Envoy_Failure_stage-api-fil-com_http2.log

Envoy Config File:
`# This envoy setup TLS at listener as well as Cluster
#command to generate self signed certificate
#openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes -subj "/C=IN/ST=Haryana/L=Gurugram/O=Vivek/OU=Envoy/CN=localhost"
static_resources:
listeners:
- address:
socket_address:
address: 0.0.0.0
port_value: 443
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
text_format: "[%START_TIME%] %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL% %UPSTREAM_CLUSTER% %UPSTREAM_PROTOCOL% %UPSTREAM_TLS_VERSION% %UPSTREAM_TLS_CIPHER% %UPSTREAM_TRANSPORT_FAILURE_REASON% %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% %REQ(X-FORWARDED-FOR)% %REQ(USER-AGENT)% %REQ(X-REQUEST-ID)% %REQ(:AUTHORITY)% %UPSTREAM_REQUEST_ATTEMPT_COUNT% %UPSTREAM_LOCAL_ADDRESS% %UPSTREAM_REMOTE_ADDRESS% \n"
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
name: all
virtual_hosts:
- name: allbackend_cluster
domains: ["localhost"]
routes:
- match:
prefix: /ip
headers:
- name: "Route"
exact_match: "httpbin"
route:
cluster: httpbin_cluster
- match:
prefix: /ip
headers:
- name: "Route"
exact_match: "apigee"
route:
cluster: apigee_cluster
- match:
prefix: /app1
route:
cluster: app1_cluster
- match:
prefix: /app2
route:
cluster: app2_cluster
- match:
prefix: /
route:
cluster: allbackend_cluster
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain: {filename: "vivek_cert.pem"}
private_key: {filename: "vivek_key.pem"}
clusters:
- name: app1_cluster
connect_timeout: 1s
type: strict_dns
lb_policy: round_robin
load_assignment:
cluster_name: app1_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: localhost
port_value: 8000
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

- name: app2_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: app2_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 8100
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
- name: httpbin_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: httpbin_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: httpbin.org
                port_value: 443
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext                
- name: apigee_cluster
  connect_timeout: 10s
  type: strict_dns
  lb_policy: round_robin
  max_requests_per_connection:  1
  typed_extension_protocol_options:
    envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
      "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
      upstream_http_protocol_options:
        auto_sni: true
      common_http_protocol_options:
        idle_timeout: 10s
      explicit_http_config:
        http2_protocol_options: {}
  load_assignment:
    cluster_name: apigee_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: a623954-eval-test.apigee.net
                port_value: 443
  transport_socket:
    name: envoy.transport_sockets.tap
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tap.v3.Tap
      common_config:
        static_config:
          match_config:
            any_match: true
          output_config:
            sinks:
              - format: PROTO_BINARY
                file_per_tap:
                  path_prefix: /Users/admin/Work/envoy_wrkspace/Envoy_Tutorial
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            tls_params:
              tls_maximum_protocol_version: TLSv1_3
              tls_minimum_protocol_version: TLSv1_2
          
- name: allbackend_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: allbackend_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3000
        
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3111
              
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3222

        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3333

admin:
address:
socket_address:
address: 0.0.0.0
port_value: 9901`

marniks7 · 2022-12-10T02:14:13Z

Hi, we are also facing this issue 503 UC during High Load on production, and can reproduce it on our Load Testing env with rate 13 out of 100k requests.

Envoy v1.20.7
Downstream and Upstream are both spring boot java apps

We are going to implement retry policy as suggested above, but maybe can help with the rootcause or alternative solution.

Logs

no tcpdump gathered so far

[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:274] [C4381] new stream
[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:867] [C4381][S2002065306070302618] request headers complete (end_stream=false):
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'connection', 'Keep-Alive'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'

[2022-12-02 18:08:52.760][23][debug][router] [source/common/router/router.cc:457] [C4381][S2002065306070302618] cluster 'xxx' match for URL '/api/v1/xxx'
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/router.cc:673] [C4381][S2002065306070302618] router decoding headers:
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
':scheme', 'http'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'
'x-forwarded-proto', 'http'
'x-envoy-expected-rq-timeout-ms', '120000'

[2022-12-02 18:08:52.761][23][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:256] trying to create new connection
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:144] creating a new connection
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:60] [C4599] connecting
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:890] [C4599] connecting to 10.101.11.122:8080
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:909] [C4599] connection in progress
[2022-12-02 18:08:52.761][23][debug][http] [source/common/http/filter_manager.cc:841] [C4381][S2002065306070302618] request end stream
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:672] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:88] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:294] [C4599] attaching to next stream
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C4599] creating stream
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/upstream_request.cc:416] [C4381][S2002065306070302618] pool ready
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:640] [C4578] remote close
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:249] [C4578] closing socket: 0
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:106] [C4578] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:142] [C4578] request reset
[2022-12-02 18:08:52.936][25][debug][router] [source/common/router/router.cc:1130] [C4250][S14424083748194906660] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:640] [C4587] remote close
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:249] [C4587] closing socket: 0
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:640] [C4561] remote close
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:249] [C4561] closing socket: 0
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:106] [C4587] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:142] [C4587] request reset
[2022-12-02 18:08:52.936][28][debug][router] [source/common/router/router.cc:1130] [C4166][S8149778245575148947] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/filter_manager.cc:953] [C4250][S14424083748194906660] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:106] [C4561] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:142] [C4561] request reset
[2022-12-02 18:08:52.936][23][debug][router] [source/common/router/router.cc:1130] [C4242][S16814319887871412706] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4250][S14424083748194906660] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/filter_manager.cc:953] [C4242][S16814319887871412706] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][28][debug][http] [source/common/http/filter_manager.cc:953] [C4166][S8149778245575148947] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4242][S16814319887871412706] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

Observation

That looks like connection is laying down in the cache and it happens sometimes that it is staying in the cache for 60s untouched and then when used - fails with the error.
Question is - where is this 60s timeout is configured? on upstream side? I was not able to find it there yet...

kurktchiev · 2023-05-10T20:08:44Z

anyone find anything on this issue?

iangelov · 2023-05-12T07:14:16Z

It's been a while since I've last dealt with this, but my case ultimately had to do with mismatched timeout settings, with Envoy having a higher default idle timeout value than one of the backend applications in an Istio mesh. Envoy would thus try and keep the connection open for a longer amount of time than the backend application's settings allow, and the backend thus severed the connection once the timeout value was reached. The solution was to add DestinationRule resources which overrode the default idle timeout on the Envoy side of things. You could, of course, also adjust the backend application's settings if it's practical/appropriate.

In case it might be useful, I figured this out by running tcpdump and noticing that the 503s were returned after exactly n seconds on every instance. From there on, it was just a matter of checking the backend and Envoy's timeout values, confirming the match and ultimately fixing the issue.

triplewy · 2023-08-15T19:33:40Z

@marniks7 Did you ever find out where the 60 second timeout comes from? I am seeing that same 60 second timeout error even though I set idle-timeout to 30 seconds and request timeout to 15 seconds. I also found that this issue only occurs with TCP downstream and not QUIC downstream despite the issue being caused by upstream.

aqua777 added bug triage Issue requires triage labels Feb 8, 2021

antoniovicente added area/http triage Issue requires triage and removed triage Issue requires triage labels Feb 9, 2021

mattklein123 added question Questions that are neither investigations, bugs, nor enhancements and removed area/http bug triage Issue requires triage labels Feb 12, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 14, 2021

github-actions bot closed this as completed Mar 21, 2021

khaitranhq mentioned this issue Jun 13, 2021

503 error with websocket #16925

Closed

huangp0600 mentioned this issue Jul 21, 2021

503 errors got from upstream even when set max_requests_per_connection=1 #17432

Closed

ivpr mentioned this issue Jul 21, 2022

Haproxy replies with TCP FIN on HTTP GET haproxy/haproxy#1109

Closed

AMontagu mentioned this issue Sep 19, 2023

Randomly get 503 UR upstream_reset_before_response_started{remote_reset} istio/istio#47037

Closed

2 tasks

guydc mentioned this issue Jan 4, 2024

Support upstream HTTP connection timeouts envoyproxy/gateway#2401

Closed

This was referenced Mar 11, 2024

fix: disable HTTP2 aslafy-z/docker-envoy-https-proxy#21

Merged

fix: reenable http2, retry on reset and reduce timeout aslafy-z/docker-envoy-https-proxy#22

Draft

zhaohuabing mentioned this issue Apr 18, 2024

enhancement: oauth2 filter supports retries to improve communication stability with OAuth2 severs #33572

Closed

This was referenced Jun 25, 2024

safe retry on reset #10007

Closed

Implement retry on connection reset to upstream istio/istio#51704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

aqua777 commented Feb 8, 2021 •

edited

Loading

aqua777 commented Feb 8, 2021

antoniovicente commented Feb 8, 2021

aqua777 commented Feb 8, 2021

yanavlasov commented Feb 9, 2021

aqua777 commented Feb 9, 2021

aqua777 commented Feb 9, 2021 •

edited

Loading

alyssawilk commented Feb 9, 2021 via email

aqua777 commented Feb 9, 2021

alyssawilk commented Feb 9, 2021 via email

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 21, 2021

ngtuna commented May 9, 2021

ngtuna commented May 9, 2021

alyssawilk commented May 10, 2021

elias4ty commented Jan 13, 2022

ivpr commented Apr 24, 2022

rajetta commented May 5, 2022

jbilliau-rcd commented Jun 13, 2022

benedikt-roth commented Jul 11, 2022

aqua777 commented Jul 11, 2022

cysnet commented Jul 26, 2022

aqua777 commented Jul 27, 2022

xiexianbin commented Aug 31, 2022

chuongnguyen5397 commented Sep 5, 2022

cyanogen12 commented Nov 30, 2022 •

edited

Loading

marniks7 commented Dec 10, 2022 •

edited

Loading

kurktchiev commented May 10, 2023

iangelov commented May 12, 2023

triplewy commented Aug 15, 2023 •

edited

Loading

Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

Comments

aqua777 commented Feb 8, 2021 • edited Loading

aqua777 commented Feb 8, 2021

antoniovicente commented Feb 8, 2021

aqua777 commented Feb 8, 2021

yanavlasov commented Feb 9, 2021

aqua777 commented Feb 9, 2021

aqua777 commented Feb 9, 2021 • edited Loading

alyssawilk commented Feb 9, 2021 via email

aqua777 commented Feb 9, 2021

alyssawilk commented Feb 9, 2021 via email

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 21, 2021

ngtuna commented May 9, 2021

ngtuna commented May 9, 2021

alyssawilk commented May 10, 2021

elias4ty commented Jan 13, 2022

ivpr commented Apr 24, 2022

rajetta commented May 5, 2022

jbilliau-rcd commented Jun 13, 2022

benedikt-roth commented Jul 11, 2022

aqua777 commented Jul 11, 2022

cysnet commented Jul 26, 2022

aqua777 commented Jul 27, 2022

xiexianbin commented Aug 31, 2022

chuongnguyen5397 commented Sep 5, 2022

cyanogen12 commented Nov 30, 2022 • edited Loading

marniks7 commented Dec 10, 2022 • edited Loading

Logs

Observation

kurktchiev commented May 10, 2023

iangelov commented May 12, 2023

triplewy commented Aug 15, 2023 • edited Loading

aqua777 commented Feb 8, 2021 •

edited

Loading

aqua777 commented Feb 9, 2021 •

edited

Loading

cyanogen12 commented Nov 30, 2022 •

edited

Loading

marniks7 commented Dec 10, 2022 •

edited

Loading

triplewy commented Aug 15, 2023 •

edited

Loading