-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination}) #14981
Comments
cc @alyssawilk One possible explanation for this class of problems may be that the upstream server closes the connection as the proxy starts sending the request. It may be helpful to know how long the upstream connection was open prior to the first request being sent on it. |
Thanks for possible explanation. We were considering this and there are factors that are against this assumption. Prior to using Envoys our load balancing were based on hardware Load Balancers. With hardware LBs we haven't seen such behavior, ie. connection terminations by backend servers. After switching to Envoys we started seeing behavior described in this issue across multiple clusters, large and small, using different webservers, serving various amount of requests, all kinds of differences. |
The error code indicates that it not Envoy that terminated the connection but rather the upstream server (or intervening network infrastructure). Did you look at the counters? For instance any of the |
@yanavlasov thank you for suggestion. In our infrastructure we have some restrictions in how many metrics we can send to Prometheus servers so we had to configure Envoy to use inclusion patterns for metrics. |
We've managed to capture one of the requests resulting in 503 response code, see the output below:
|
So offhand this still looks like an upstream disconnect. Any chance you
can capture a package trace to see if that's what is happening?
Are you using HTTP/1.1 by any chance? There's a known race where if
upstream sends response1 and closes the connection _without_ sending
normal connection close headers, that Enovy will read the response and can
reassign the connection to a new stream before picking up the tcp FIN,
resulting in near immediate reset.
…On Tue, Feb 9, 2021 at 10:50 AM aqua777 ***@***.***> wrote:
We've managed to capture one of the requests resulting in 503 response
code, see the output below:
[2021-02-09 14:23:14.220][24][debug][conn_handler] [source/server/connection_handler_impl.cc:501] [C1876] new connection
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:254] [C1876] new stream
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/conn_manager_impl.cc:886] [C1876][S15748738241223205728] request headers complete (end_stream=true):
':authority', '...'
':path', '/...'
':method', 'GET'
'x-jwt', '***'
'accept', 'application/json'
'connection', 'close'
[2021-02-09 14:23:14.222][24][debug][http] [source/common/http/filter_manager.cc:755] [C1876][S15748738241223205728] request end stream
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:425] [C1876][S15748738241223205728] cluster 'some_cluster' match for URL '/...'
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/router.cc:582] [C1876][S15748738241223205728] router decoding headers:
':authority', '...'
':path', '/...'
':method', 'GET'
':scheme', 'http'
'x-jwt', '***'
'accept', 'application/json'
'x-forwarded-for', '10.64.x.x'
'x-forwarded-proto', 'https'
'x-envoy-internal', 'true'
'x-request-id', 'f3859bd9-2f6b-4e97-af6a-1ead24aef8a7'
'x-envoy-expected-rq-timeout-ms', '303000'
[2021-02-09 14:23:14.222][24][debug][router] [source/common/router/upstream_request.cc:354] [C1876][S15748738241223205728] pool ready
[2021-02-09 14:23:14.223][24][debug][router] [source/common/router/router.cc:1026] [C1876][S15748738241223205728] upstream reset: reset reason: connection termination, transport failure reason:
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/filter_manager.cc:839] [C1876][S15748738241223205728] Sending local reply with details upstream_reset_before_response_started{connection termination}
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1429] [C1876][S15748738241223205728] closing connection due to connection close header
[2021-02-09 14:23:14.223][24][debug][http] [source/common/http/conn_manager_impl.cc:1484] [C1876][S15748738241223205728] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Tue, 09 Feb 2021 14:23:13 GMT'
'server', 'envoy'
'connection', 'close'
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:132] [C1876] closing data_to_write=248 type=2
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl_base.cc:47] [C1876] setting delayed close timer with timeout 1000 ms
[2021-02-09 14:23:14.223][24][debug][connection] [source/common/network/connection_impl.cc:696] [C1876] write flush complete
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14981 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AELALPKR3CDHEKREUCQBYM3S6FKVDANCNFSM4XJJRKNA>
.
|
@alyssawilk AFAIK, these requests are using HTTP/1.1 indeed. By What would you recommend to capture package trace? Something like |
Yeah, so if I had to bet I'd guess your upstream is sending a response
without a connection close header, then closing the connection.
a TCP dump will tell you if this is indeed what's happening (if upstream is
sending that TCP FIN), and confirm or deny you're hitting the race I
mentioned. Unfortunately there's not a ton Envoy can do here today -
generally we encourage folks to make sure if the upstream is closing
HTTP/1.1 connections it use the connection: close header (which will cause
Envoy to not reuse the connection, and avoid the race entirely).
…On Tue, Feb 9, 2021 at 12:28 PM aqua777 ***@***.***> wrote:
@alyssawilk <https://github.com/alyssawilk> AFAIK, these requests are
using HTTP/1.1 indeed. By upstream disconnect do you mean that in your
interpretation the connection is being reset by the upstream side?
What would you recommend to capture package trace? Something like tcpdump
or do you have other tools in mind?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14981 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AELALPNLD44Z47K4BAPV3RLS6FWDVANCNFSM4XJJRKNA>
.
|
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions. |
@aqua777 So did you finally manage to figure out the root cause ? I am getting quite similar issue today. It seems like @alyssawilk explained; but I don't know how to fix it. Any hint ? Thanks!
|
This line makes me very confused
|
from The upstream connection was reset after a response was started. This may include further details about the cause of the disconnect. So theoretically your upstream is closing the connection before it sends any part of the response. If there's areas we can improve docs here to make it easier to understand, please let us know! |
I encounter this problem too! my upstream server is nginx,and the log of nginx that back to envoy is : "PRI * HTTP/2.0" 400 157 "-" "-" "-". That is so wired that envoy send a request to nginx base on http1.1, why nginx treat it as http2.0 or grpc request |
@aqua777 Were you able to identify the problem? We are facing a similar issue in our production |
@ivpr we are getting similar error after updating envoy to 1.17 - in our case , it is for every request. Did configuring retry on reset helped? |
We are receiving this as well, 3 requests out of 24,000+ had a 503 UC error, not sure why. Pods are all healthy, have been up for days, resources all low....anyone figure this out? |
We are also facing this issue and have no idea why. All pods are up and worked fine in the past. After Envoy integration, this changed and we now see above errors. |
We ended up implementing Envoy configuration to retry the requests once when connection to backend servers is reset. This eliminated the majority of errors although we still get some. |
@aqua777 why not paste your fix solution to here to help the other users? |
We just followed Envoy documentation on Retry policies: match:
prefix: /some/url/prefix
route:
retry_policy:
retry_on: reset You can also specify |
I also meet this error, and find ssl expired logs in istio-ingressgateway
so I delete ingressgateway pod |
I'm facing the same issue, do we have any conclusion about the reason? Here is my config:
And the logs are showing
|
Hi All, I am also facing similar issue with Envoy version 1.24. I have created three upstream clusters: For third cluster- "httpbin" I am using public service https://httpbin.org/ip and this also works well There is another public https api which I want to consume and it always fails while works well when invoked directly from terminal/ postman. Cluster - apigee_cluster Scenario 3: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http1 option, I get 503 UC error. Scenario 4: I try to invoke the endpoint https://stage.api.fil.com/ip in which I am really interested with http2 option, I get 502 UPE error. Envoy Config File:
admin: |
anyone find anything on this issue? |
It's been a while since I've last dealt with this, but my case ultimately had to do with mismatched timeout settings, with Envoy having a higher default idle timeout value than one of the backend applications in an Istio mesh. Envoy would thus try and keep the connection open for a longer amount of time than the backend application's settings allow, and the backend thus severed the connection once the timeout value was reached. The solution was to add DestinationRule resources which overrode the default idle timeout on the Envoy side of things. You could, of course, also adjust the backend application's settings if it's practical/appropriate. In case it might be useful, I figured this out by running tcpdump and noticing that the 503s were returned after exactly n seconds on every instance. From there on, it was just a matter of checking the backend and Envoy's timeout values, confirming the match and ultimately fixing the issue. |
@marniks7 Did you ever find out where the 60 second timeout comes from? I am seeing that same 60 second timeout error even though I set idle-timeout to 30 seconds and request timeout to 15 seconds. I also found that this issue only occurs with TCP downstream and not QUIC downstream despite the issue being caused by upstream. |
Title: Envoy intermittently responds with 503 UC (upstream_reset_before_response_started{connection_termination})
Description:
We have a group of Envoy servers running as an Edge Proxy for a number of backend services. We configure these Envoy servers using in-house built xDS GRPC server. See example configuration dump attached to this issue.
We started noticing intermittent responses with error code 503 and response flag "UC" +
upstream_reset_before_response_started{connection_termination}
appearing in response code details. It seems that these responses are produced by Envoy and not returned by the backend services. We observed this issue while running Envoy versions v1.16 and v1.17.We looked at some previously reported similar GitHub issues, such as:
We tried to change some configuration settings, namely:
None of the actions mentioned above had any influence on the behavior of our Envoys' fleet.
Repro steps:
We have no repeatable method for reproducing these errors. They are very sporadic, ie. 1-2 per 100000 requests. They are spread across various clusters. Some of these clusters receive large and constant amount of requests so the possibility of the upstream connections to become idle is very unlikely.
Config:
See included in tarball attached to this issue
Logs:
This issue occurs in our Production environment where we do not have debug level logging enabled.
Can you please advise:
The text was updated successfully, but these errors were encountered: