Intermittent failed requests during rolling updates #814

deliahu · 2020-02-14T17:14:01Z

Description

As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with service of type: loadbalancer, the error is:

Post http://a2b34b2aa4ec411eaab8a0a465d27bdf-e1fddee29350e253.elb.us-west-2.amazonaws.com/predict: read tcp 172.31.1.222:42654->54.71.144.207:80: read: connection reset by peer

Reproduction

Create an iris deployment with min and max of e.g. 2 replicas. Run dev/load.go with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or the connection reset by peer error if using the load balancer service) as the old pods are terminating.

Relevant info

https://stackoverflow.com/questions/57727457/why-there-is-downtime-while-rolling-update-a-deployment-or-even-scaling-down-a-r

Possibly related issue

Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.

Possibly related issue 2

It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.

The text was updated successfully, but these errors were encountered:

deliahu added bug Something isn't working v0.15 labels Feb 14, 2020

deliahu added v0.16 and removed v0.15 labels Mar 9, 2020

deliahu changed the title ~~Intermittent failed requests during rolling updates~~ Intermittent failed requests during rolling updates (and possibly scale ups) Mar 25, 2020

deliahu added v0.17 and removed v0.16 labels Apr 14, 2020

deliahu removed the v0.17 label May 21, 2020

RobertLucian added the v0.19 label Aug 14, 2020

RobertLucian mentioned this issue Aug 17, 2020

upstream connect error or disconnect/reset before headers. reset reason: connection failure #1269

Closed

deliahu added v0.20 and removed v0.19 labels Aug 27, 2020

deliahu self-assigned this Aug 27, 2020

deliahu removed their assignment Sep 17, 2020

deliahu removed the v0.20 label Sep 18, 2020

deliahu mentioned this issue Nov 10, 2020

Improve API inter-process queue fairness #1526

Merged

2 tasks

deliahu closed this as completed Nov 10, 2020

deliahu changed the title ~~Intermittent failed requests during rolling updates (and possibly scale ups)~~ Intermittent failed requests during rolling updates Nov 10, 2020

deliahu added the v0.22 label Nov 26, 2020

deliahu added this to the v0.22 milestone Nov 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent failed requests during rolling updates #814

Intermittent failed requests during rolling updates #814

deliahu commented Feb 14, 2020 •

edited

Loading

Intermittent failed requests during rolling updates #814

Intermittent failed requests during rolling updates #814

Comments

deliahu commented Feb 14, 2020 • edited Loading

Description

Reproduction

Relevant info

Possibly related issue

Possibly related issue 2

deliahu commented Feb 14, 2020 •

edited

Loading