Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981

Closed
aqua777 opened this issue Feb 8, 2021 · 29 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently

Comments

@aqua777
Copy link

aqua777 commented Feb 8, 2021

Title: Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination})

Description:

What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should not crash, the expected value isn't
returned, etc.

We have a group of Envoy servers running as an Edge Proxy for a number of backend services. We configure these Envoy servers using in-house built xDS GRPC server. See example configuration dump attached to this issue.

We started noticing intermittent responses with error code 503 and response flag "UC" + upstream_reset_before_response_started{connection_termination} appearing in response code details. It seems that these responses are produced by Envoy and not returned by the backend services. We observed this issue while running Envoy versions v1.16 and v1.17.

We looked at some previously reported similar GitHub issues, such as:

We tried to change some configuration settings, namely:

  • enable TCP Keep-alives
  • adjust Idle Timeouts (for ex. use default values, disable idle timeout)

None of the actions mentioned above had any influence on the behavior of our Envoys' fleet.

Repro steps:
We have no repeatable method for reproducing these errors. They are very sporadic, ie. 1-2 per 100000 requests. They are spread across various clusters. Some of these clusters receive large and constant amount of requests so the possibility of the upstream connections to become idle is very unlikely.

Config:
See included in tarball attached to this issue

Logs:
This issue occurs in our Production environment where we do not have debug level logging enabled.

Can you please advise:

  • why does the Envoy terminates connections?
  • have this problem been reported by other Envoy's users?
  • is there a possibility that requests are not distributed evenly across all backend servers which could lead to Envoy resetting upstream connections?
  • what can we do on our side to eliminate this problem?
@aqua777 aqua777 added bug triage Issue requires triage labels Feb 8, 2021
@aqua777
Copy link
Author

aqua777 commented Feb 8, 2021

@antoniovicente
Copy link
Contributor

cc @alyssawilk

One possible explanation for this class of problems may be that the upstream server closes the connection as the proxy starts sending the request. It may be helpful to know how long the upstream connection was open prior to the first request being sent on it.

@aqua777
Copy link
Author

aqua777 commented Feb 8, 2021

cc @alyssawilk

One possible explanation for this class of problems may be that the upstream server closes the connection as the proxy starts sending the request. It may be helpful to know how long the upstream connection was open prior to the first request being sent on it.

Thanks for possible explanation. We were considering this and there are factors that are against this assumption. Prior to using Envoys our load balancing were based on hardware Load Balancers. With hardware LBs we haven't seen such behavior, ie. connection terminations by backend servers. After switching to Envoys we started seeing behavior described in this issue across multiple clusters, large and small, using different webservers, serving various amount of requests, all kinds of differences.

@yanavlasov
Copy link
Contributor

The error code indicates that it not Envoy that terminated the connection but rather the upstream server (or intervening network infrastructure). Did you look at the counters? For instance any of the upstream_cx_... error counters corresponding to disconnects?

@aqua777
Copy link
Author

aqua777 commented Feb 9, 2021

@yanavlasov thank you for suggestion. In our infrastructure we have some restrictions in how many metrics we can send to Prometheus servers so we had to configure Envoy to use inclusion patterns for metrics. upstream_cx_destroy... metrics are not on that list, we will add them and will provide more details shortly.

@aqua777
Copy link
Author

aqua777 commented Feb 9, 2021

We've managed to capture one of the requests resulting in 503 response code, see the output below:

[2021-02-09 14:23:14.220][24][debug][conn_handler] [source/server/connection_handler_impl.cc:501] [C1876] new connection
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:254] [C1876] new stream
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:886] [C1876][S15748738241223205728] request headers complete (end_stream=true):
':authority', '...'
':path', '/...'
':method', 'GET'
'x-jwt', '***'
'accept', 'application/json'
'connection', 'close'

[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/filter_manager.cc:755] [C1876][S15748738241223205728] request end stream
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:425] [C1876][S15748738241223205728] cluster 'some_cluster' match for URL '/...'
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:582] [C1876][S15748738241223205728] router decoding headers:
':authority', '...'
':path', '/...'
':method', 'GET'
':scheme', 'http'
'x-jwt', '***'
'accept', 'application/json'
'x-forwarded-for', 'x.x.x.x'
'x-forwarded-proto', 'https'
'x-envoy-internal', 'true'
'x-request-id', 'f3859bd9-2f6b-4e97-af6a-1ead24aef8a7'
'x-envoy-expected-rq-timeout-ms', '303000'

[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/upstream_request.cc:354] [C1876][S15748738241223205728] pool ready
[2021-02-09 14:23:14.223][24][debug][router] [source/common/router/router.cc:1026] [C1876][S15748738241223205728] upstream reset: reset reason: connection termination, transport failure reason:
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/filter_manager.cc:839] [C1876][S15748738241223205728] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1429] [C1876][S15748738241223205728] closing connection due to connection close header
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C1876][S15748738241223205728] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Tue, 09 Feb 2021 14:23:13 GMT'
'server', 'envoy'
'connection', 'close'

[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:132] [C1876] closing data_to_write=248 type=2
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl_base.cc:47] [C1876] setting delayed close timer with timeout 1000 ms
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:696] [C1876] write flush complete

@alyssawilk
Copy link
Contributor

alyssawilk commented Feb 9, 2021 via email

@aqua777
Copy link
Author

aqua777 commented Feb 9, 2021

@alyssawilk AFAIK, these requests are using HTTP/1.1 indeed. By upstream disconnect do you mean that in your interpretation the connection is being reset by the upstream side?

What would you recommend to capture package trace? Something like tcpdump or do you have other tools in mind?

@alyssawilk
Copy link
Contributor

alyssawilk commented Feb 9, 2021 via email

@antoniovicente antoniovicente added area/http triage Issue requires triage and removed triage Issue requires triage labels Feb 9, 2021
@mattklein123 mattklein123 added question Questions that are neither investigations, bugs, nor enhancements and removed area/http bug triage Issue requires triage labels Feb 12, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 14, 2021
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@ngtuna
Copy link

ngtuna commented May 9, 2021

@aqua777 So did you finally manage to figure out the root cause ? I am getting quite similar issue today. It seems like @alyssawilk explained; but I don't know how to fix it. Any hint ? Thanks!

[2021-05-10 00:37:31.782][3549452][debug][router] [source/common/router/router.cc:425] [C0][S15323405275991739010] cluster '...' match for URL '...'
[2021-05-10 00:37:31.783][3549452][debug][router] [source/common/router/router.cc:582] [C0][S15323405275991739010] router decoding headers:
':authority', '...:8443'
':path', '...'
':method', 'GET'
':scheme', 'http'
'user-agent', 'curl/7.64.1'
'accept', '*/*'
'x-forwarded-proto', 'https'
'x-request-id', 'a0fbb5a1-6680-4274-a213-2fc3184a0c71'
'x-envoy-expected-rq-timeout-ms', '15000'

[2021-05-10 00:37:31.783][3549452][debug][pool] [source/common/http/conn_pool_base.cc:79] queueing stream due to no available connections
[2021-05-10 00:37:31.783][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:106] creating a new connection
[2021-05-10 00:37:31.783][3549452][debug][client] [source/common/http/codec_client.cc:41] [C1] connecting
[2021-05-10 00:37:31.783][3549452][debug][connection] [source/common/network/connection_impl.cc:860] [C1] connecting to 10.28.1.4:8443
[2021-05-10 00:37:31.783][3549452][debug][connection] [source/common/network/connection_impl.cc:876] [C1] connection in progress
[2021-05-10 00:37:31.784][3549452][trace][http] [source/common/http/filter_manager.cc:498] [C0][S15323405275991739010] decode headers called: filter=0xca6da00 status=1
[2021-05-10 00:37:31.784][3549452][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0xca6d980 for 300000ms, min is 300000ms
[2021-05-10 00:37:31.784][3549452][trace][http] [source/common/http/http1/codec_impl.cc:620] [C0] parsed 155 bytes
[2021-05-10 00:37:31.784][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C0] socket event: 2
[2021-05-10 00:37:31.784][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C0] write ready
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 2
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.062][3549452][debug][connection] [source/common/network/connection_impl.cc:666] [C1] connected
[2021-05-10 00:37:32.062][3549452][trace][connection] [source/common/network/connection_impl.cc:407] [C1] raising connection event 2
[2021-05-10 00:37:32.062][3549452][debug][client] [source/common/http/codec_client.cc:80] [C1] connected
[2021-05-10 00:37:32.063][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:225] [C1] attaching to next stream
[2021-05-10 00:37:32.063][3549452][debug][pool] [source/common/conn_pool/conn_pool_base.cc:130] [C1] creating stream
[2021-05-10 00:37:32.063][3549452][debug][router] [source/common/router/upstream_request.cc:354] [C0][S15323405275991739010] pool ready
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:465] [C1] writing 291 bytes, end_stream false
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/raw_buffer_socket.cc:68] [C1] write returns: 291
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 2
[2021-05-10 00:37:32.063][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:548] [C1] socket event: 3
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:657] [C1] write ready
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:586] [C1] read ready. dispatch_buffered_data=false
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/raw_buffer_socket.cc:25] [C1] read returns: 0
[2021-05-10 00:37:32.472][3549452][debug][connection] [source/common/network/connection_impl.cc:634] [C1] remote close
[2021-05-10 00:37:32.472][3549452][debug][connection] [source/common/network/connection_impl.cc:241] [C1] closing socket: 0
[2021-05-10 00:37:32.472][3549452][trace][connection] [source/common/network/connection_impl.cc:407] [C1] raising connection event 0
[2021-05-10 00:37:32.472][3549452][trace][http] [source/common/http/http1/codec_impl.cc:571] [C1] parsing 0 bytes
[2021-05-10 00:37:32.472][3549452][trace][http] [source/common/http/http1/codec_impl.cc:620] [C1] parsed 0 bytes
[2021-05-10 00:37:32.472][3549452][debug][client] [source/common/http/codec_client.cc:99] [C1] disconnect. resetting 1 pending requests
[2021-05-10 00:37:32.472][3549452][debug][client] [source/common/http/codec_client.cc:125] [C1] request reset
[2021-05-10 00:37:32.472][3549452][trace][main] [source/common/event/dispatcher_impl.cc:213] item added to deferred deletion list (size=1)
[2021-05-10 00:37:32.472][3549452][debug][router] [source/common/router/router.cc:1026] [C0][S15323405275991739010] upstream reset: reset reason: connection termination, transport failure reason:
[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/filter_manager.cc:839] [C0][S15323405275991739010] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2021-05-10 00:37:32.476][3549452][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0xca6d980 for 300000ms, min is 300000ms
[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C0][S15323405275991739010] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Sun, 09 May 2021 17:37:32 GMT'
'server', 'envoy'

@ngtuna
Copy link

ngtuna commented May 9, 2021

This line makes me very confused

[2021-05-10 00:37:32.476][3549452][debug][http] [source/common/http/filter_manager.cc:839] [C0][S15323405275991739010] Sending local reply with details upstream_reset_before_response_started{connection termination}

@alyssawilk
Copy link
Contributor

from
https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/response_code_details

The upstream connection was reset after a response was started. This may include further details about the cause of the disconnect.

So theoretically your upstream is closing the connection before it sends any part of the response.

If there's areas we can improve docs here to make it easier to understand, please let us know!

@elias4ty
Copy link

I encounter this problem too! my upstream server is nginx,and the log of nginx that back to envoy is : "PRI * HTTP/2.0" 400 157 "-" "-" "-". That is so wired that envoy send a request to nginx base on http1.1, why nginx treat it as http2.0 or grpc request

@ivpr
Copy link

ivpr commented Apr 24, 2022

@aqua777 Were you able to identify the problem? We are facing a similar issue in our production
@alyssawilk Is configuring envoy to retry on reset right way to workaround the issue?

@rajetta
Copy link

rajetta commented May 5, 2022

@ivpr we are getting similar error after updating envoy to 1.17 - in our case , it is for every request. Did configuring retry on reset helped?

@jbilliau-rcd
Copy link

We are receiving this as well, 3 requests out of 24,000+ had a 503 UC error, not sure why. Pods are all healthy, have been up for days, resources all low....anyone figure this out?

@benedikt-roth
Copy link

We are also facing this issue and have no idea why. All pods are up and worked fine in the past. After Envoy integration, this changed and we now see above errors.

@aqua777
Copy link
Author

aqua777 commented Jul 11, 2022

We ended up implementing Envoy configuration to retry the requests once when connection to backend servers is reset. This eliminated the majority of errors although we still get some.

@cysnet
Copy link

cysnet commented Jul 26, 2022

@aqua777 why not paste your fix solution to here to help the other users?

@aqua777
Copy link
Author

aqua777 commented Jul 27, 2022

We just followed Envoy documentation on Retry policies:

match:
  prefix: /some/url/prefix
route:
  retry_policy:
    retry_on: reset

You can also specify num_retries which by default is 1 which is how we wanted it hence we didn't add it explicitly to configuration.

@xiexianbin
Copy link

I also meet this error, and find ssl expired logs in istio-ingressgateway

2022-08-31T12:40:11.890424Z	debug	envoy pool	[C18719] client disconnected, failure reason: TLS error: 268436501:SSL routines:OPENSSL_internal:SSLV3_ALERT_CERTIFICATE_EXPIRED 33554464:system library:OPENSSL_internal:Broken pipe

so I delete ingressgateway pod kubectl -n istio-system delete pod/istio-ingressgateway-xxx, retry is ok

@chuongnguyen5397
Copy link

I'm facing the same issue, do we have any conclusion about the reason?

Here is my config:

  static_resources:
    listeners:
      - name: listener
        address:
          socket_address:
            address: 0.0.0.0
            port_value: 10000
        filter_chains:
          - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config: 
                '@type': "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager"
                stat_prefix: http_proxy
                access_log:
                  - name: envoy.access_loggers.stdout
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
                route_config: 
                  name: all
                  virtual_hosts:
                    - name: local_service
                      domains: 
                        - '*'
                      routes:
                        - match: { prefix: "/api/v1/intranet"}
                          route:
                            prefix_rewrite: "/api/v1"
                            host_rewrite_literal: "random.execute-api.ap-southeast-1.amazonaws.com"
                            cluster: allbackend_cluster
                http_filters:
                    - name: envoy.filters.http.router
    clusters:
      - name: allbackend_cluster
        connect_timeout: 10s
        type: strict_dns
        lb_policy: round_robin
        load_assignment:
          cluster_name: allbackend_cluster
          endpoints:
            - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: random.execute-api.ap-southeast-1.amazonaws.com
                      port_value: 443
        transport_socket:
          name: envoy.transport_sockets.tls
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

And the logs are showing

[2022-09-05 14:28:52.796][19][debug][router] [source/common/router/router.cc:1091] [C20][S2573947341148914084] upstream reset: reset reason: connection termination, transport failure reason: 
[2022-09-05 14:28:52.796][19][debug][http] [source/common/http/filter_manager.cc:883] [C20][S2573947341148914084] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-09-05 14:28:52.796][19][debug][http] [source/common/http/conn_manager_impl.cc:1455] [C20][S2573947341148914084] encoding headers via codec (end_stream=false):

@cyanogen12
Copy link

cyanogen12 commented Nov 30, 2022

Hi All,

I am also facing similar issue with Envoy version 1.24.
I am running envoy on my local machine - Macbook Pro

I have created three upstream clusters:
Cluster 1 & 2 are hosted locally on the same machine as https endpoints. and they are working well.

For third cluster- "httpbin" I am using public service https://httpbin.org/ip and this also works well

There is another public https api which I want to consume and it always fails while works well when invoked directly from terminal/ postman. Cluster - apigee_cluster
Scenario 1: When force http1 in cluster using configuration, I get response 404 while the endpoint actually exists:
explicit_http_config:
http_protocol_options: {}
Endpoint I am trying to invoke is : https://a623954-eval-test.apigee.net/ip
Refer Log file: Envoy_Failure_http1.log
Envoy_Failure_http1.log
image
Scenario 2: When force http2 in cluster using configuration, I get response 502 UPE while the endpoint actually exists:
explicit_http_config:
http2_protocol_options: {}
Endpoint I am trying to invoke is : https://a623954-eval-test.apigee.net/ip
Refer Log file: Envoy_Failure_http2.log
Envoy_Failure_http2.log
image

Scenario 3: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http1 option, I get 503 UC error.
image
Envoy_Failure_stage-api-fil-com_http1.log

Scenario 4: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http2 option, I get 502 UPE error.
image
Envoy_Failure_stage-api-fil-com_http2.log

Envoy Config File:
`# This envoy setup TLS at listener as well as Cluster
#command to generate self signed certificate
#openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -sha256 -days 365 -nodes -subj "/C=IN/ST=Haryana/L=Gurugram/O=Vivek/OU=Envoy/CN=localhost"
static_resources:
listeners:
- address:
socket_address:
address: 0.0.0.0
port_value: 443
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
access_log:
- name: envoy.access_loggers.stdout
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
log_format:
text_format: "[%START_TIME%] %REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL% %UPSTREAM_CLUSTER% %UPSTREAM_PROTOCOL% %UPSTREAM_TLS_VERSION% %UPSTREAM_TLS_CIPHER% %UPSTREAM_TRANSPORT_FAILURE_REASON% %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% %REQ(X-FORWARDED-FOR)% %REQ(USER-AGENT)% %REQ(X-REQUEST-ID)% %REQ(:AUTHORITY)% %UPSTREAM_REQUEST_ATTEMPT_COUNT% %UPSTREAM_LOCAL_ADDRESS% %UPSTREAM_REMOTE_ADDRESS% \n"
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
name: all
virtual_hosts:
- name: allbackend_cluster
domains: ["localhost"]
routes:
- match:
prefix: /ip
headers:
- name: "Route"
exact_match: "httpbin"
route:
cluster: httpbin_cluster
- match:
prefix: /ip
headers:
- name: "Route"
exact_match: "apigee"
route:
cluster: apigee_cluster
- match:
prefix: /app1
route:
cluster: app1_cluster
- match:
prefix: /app2
route:
cluster: app2_cluster
- match:
prefix: /
route:
cluster: allbackend_cluster
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain: {filename: "vivek_cert.pem"}
private_key: {filename: "vivek_key.pem"}
clusters:
- name: app1_cluster
connect_timeout: 1s
type: strict_dns
lb_policy: round_robin
load_assignment:
cluster_name: app1_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: localhost
port_value: 8000
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

- name: app2_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: app2_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 8100
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
- name: httpbin_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: httpbin_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: httpbin.org
                port_value: 443
  transport_socket:
    name: envoy.transport_sockets.tls
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext                
- name: apigee_cluster
  connect_timeout: 10s
  type: strict_dns
  lb_policy: round_robin
  max_requests_per_connection:  1
  typed_extension_protocol_options:
    envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
      "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
      upstream_http_protocol_options:
        auto_sni: true
      common_http_protocol_options:
        idle_timeout: 10s
      explicit_http_config:
        http2_protocol_options: {}
  load_assignment:
    cluster_name: apigee_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: a623954-eval-test.apigee.net
                port_value: 443
  transport_socket:
    name: envoy.transport_sockets.tap
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.transport_sockets.tap.v3.Tap
      common_config:
        static_config:
          match_config:
            any_match: true
          output_config:
            sinks:
              - format: PROTO_BINARY
                file_per_tap:
                  path_prefix: /Users/admin/Work/envoy_wrkspace/Envoy_Tutorial
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            tls_params:
              tls_maximum_protocol_version: TLSv1_3
              tls_minimum_protocol_version: TLSv1_2
          
- name: allbackend_cluster
  connect_timeout: 1s
  type: strict_dns
  lb_policy: round_robin
  load_assignment:
    cluster_name: allbackend_cluster
    endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3000
        
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3111
              
        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3222

        - endpoint:
            address:
              socket_address:
                address: localhost
                port_value: 3333                

admin:
address:
socket_address:
address: 0.0.0.0
port_value: 9901`

@marniks7
Copy link

marniks7 commented Dec 10, 2022

Hi, we are also facing this issue 503 UC during High Load on production, and can reproduce it on our Load Testing env with rate 13 out of 100k requests.

  • Envoy v1.20.7
  • Downstream and Upstream are both spring boot java apps

We are going to implement retry policy as suggested above, but maybe can help with the rootcause or alternative solution.

Logs

no tcpdump gathered so far

[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:274] [C4381] new stream
[2022-12-02 18:08:52.760][23][debug][http] [source/common/http/conn_manager_impl.cc:867] [C4381][S2002065306070302618] request headers complete (end_stream=false):
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'connection', 'Keep-Alive'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'

[2022-12-02 18:08:52.760][23][debug][router] [source/common/router/router.cc:457] [C4381][S2002065306070302618] cluster 'xxx' match for URL '/api/v1/xxx'
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/router.cc:673] [C4381][S2002065306070302618] router decoding headers:
':authority', 'xxx:8080'
':path', '/api/v1/xxx'
':method', 'POST'
':scheme', 'http'
'accept', 'application/json'
'content-type', 'application/json'
'authorization', 'Bearer xxx'
'tenant', '76356188-e2de-4e7f-9628-3b637dab3960'
'x-request-id', '69b7c8f56d0dce335c07c6e6f4b01e12'
'x-b3-traceid', 'a63dd122acf1982e'
'x-b3-spanid', '620c330fe2028275'
'x-b3-parentspanid', '07dec65c20d7d27e'
'x-b3-sampled', '1'
'accept-encoding', 'gzip, deflate'
'content-length', '3301'
'user-agent', 'Apache-HttpClient/4.5.13 (Java/17.0.4.1)'
'x-forwarded-proto', 'http'
'x-envoy-expected-rq-timeout-ms', '120000'

[2022-12-02 18:08:52.761][23][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:256] trying to create new connection
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:144] creating a new connection
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:60] [C4599] connecting
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:890] [C4599] connecting to 10.101.11.122:8080
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:909] [C4599] connection in progress
[2022-12-02 18:08:52.761][23][debug][http] [source/common/http/filter_manager.cc:841] [C4381][S2002065306070302618] request end stream
[2022-12-02 18:08:52.761][23][debug][connection] [source/common/network/connection_impl.cc:672] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][client] [source/common/http/codec_client.cc:88] [C4599] connected
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:294] [C4599] attaching to next stream
[2022-12-02 18:08:52.761][23][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C4599] creating stream
[2022-12-02 18:08:52.761][23][debug][router] [source/common/router/upstream_request.cc:416] [C4381][S2002065306070302618] pool ready
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:640] [C4578] remote close
[2022-12-02 18:08:52.936][25][debug][connection] [source/common/network/connection_impl.cc:249] [C4578] closing socket: 0
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:106] [C4578] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][25][debug][client] [source/common/http/codec_client.cc:142] [C4578] request reset
[2022-12-02 18:08:52.936][25][debug][router] [source/common/router/router.cc:1130] [C4250][S14424083748194906660] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:640] [C4587] remote close
[2022-12-02 18:08:52.936][28][debug][connection] [source/common/network/connection_impl.cc:249] [C4587] closing socket: 0
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:640] [C4561] remote close
[2022-12-02 18:08:52.936][23][debug][connection] [source/common/network/connection_impl.cc:249] [C4561] closing socket: 0
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:106] [C4587] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][28][debug][client] [source/common/http/codec_client.cc:142] [C4587] request reset
[2022-12-02 18:08:52.936][28][debug][router] [source/common/router/router.cc:1130] [C4166][S8149778245575148947] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/filter_manager.cc:953] [C4250][S14424083748194906660] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:106] [C4561] disconnect. resetting 1 pending requests
[2022-12-02 18:08:52.936][23][debug][client] [source/common/http/codec_client.cc:142] [C4561] request reset
[2022-12-02 18:08:52.936][23][debug][router] [source/common/router/router.cc:1130] [C4242][S16814319887871412706] upstream reset: reset reason: connection termination, transport failure reason:
[2022-12-02 18:08:52.936][25][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4250][S14424083748194906660] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/filter_manager.cc:953] [C4242][S16814319887871412706] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][28][debug][http] [source/common/http/filter_manager.cc:953] [C4166][S8149778245575148947] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2022-12-02 18:08:52.936][23][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C4242][S16814319887871412706] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Fri, 02 Dec 2022 18:08:52 GMT'

Observation

That looks like connection is laying down in the cache and it happens sometimes that it is staying in the cache for 60s untouched and then when used - fails with the error.
Question is - where is this 60s timeout is configured? on upstream side? I was not able to find it there yet...
image
image
image

@kurktchiev
Copy link

anyone find anything on this issue?

@iangelov
Copy link

It's been a while since I've last dealt with this, but my case ultimately had to do with mismatched timeout settings, with Envoy having a higher default idle timeout value than one of the backend applications in an Istio mesh. Envoy would thus try and keep the connection open for a longer amount of time than the backend application's settings allow, and the backend thus severed the connection once the timeout value was reached. The solution was to add DestinationRule resources which overrode the default idle timeout on the Envoy side of things. You could, of course, also adjust the backend application's settings if it's practical/appropriate.

In case it might be useful, I figured this out by running tcpdump and noticing that the 503s were returned after exactly n seconds on every instance. From there on, it was just a matter of checking the backend and Envoy's timeout values, confirming the match and ultimately fixing the issue.

@triplewy
Copy link

triplewy commented Aug 15, 2023

@marniks7 Did you ever find out where the 60 second timeout comes from? I am seeing that same 60 second timeout error even though I set idle-timeout to 30 seconds and request timeout to 15 seconds. I also found that this issue only occurs with TCP downstream and not QUIC downstream despite the issue being caused by upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests