You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with service of type: loadbalancer, the error is:
Post http://a2b34b2aa4ec411eaab8a0a465d27bdf-e1fddee29350e253.elb.us-west-2.amazonaws.com/predict: read tcp 172.31.1.222:42654->54.71.144.207:80: read: connection reset by peer
Reproduction
Create an iris deployment with min and max of e.g. 2 replicas. Run dev/load.go with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or the connection reset by peer error if using the load balancer service) as the old pods are terminating.
Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.
Possibly related issue 2
It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.
The text was updated successfully, but these errors were encountered:
deliahu
changed the title
Intermittent failed requests during rolling updates
Intermittent failed requests during rolling updates (and possibly scale ups)
Mar 25, 2020
deliahu
changed the title
Intermittent failed requests during rolling updates (and possibly scale ups)
Intermittent failed requests during rolling updates
Nov 10, 2020
Description
As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with
service
oftype: loadbalancer
, the error is:Reproduction
Create an iris deployment with min and max of e.g. 2 replicas. Run
dev/load.go
with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or theconnection reset by peer
error if using the load balancer service) as the old pods are terminating.Relevant info
Possibly related issue
Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.
Possibly related issue 2
It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.
The text was updated successfully, but these errors were encountered: