Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failed requests during rolling updates #814

Closed
deliahu opened this issue Feb 14, 2020 · 0 comments · Fixed by #1526
Closed

Intermittent failed requests during rolling updates #814

deliahu opened this issue Feb 14, 2020 · 0 comments · Fixed by #1526
Labels
bug Something isn't working
Milestone

Comments

@deliahu
Copy link
Member

deliahu commented Feb 14, 2020

Description

As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with service of type: loadbalancer, the error is:

Post http://a2b34b2aa4ec411eaab8a0a465d27bdf-e1fddee29350e253.elb.us-west-2.amazonaws.com/predict: read tcp 172.31.1.222:42654->54.71.144.207:80: read: connection reset by peer

Reproduction

Create an iris deployment with min and max of e.g. 2 replicas. Run dev/load.go with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or the connection reset by peer error if using the load balancer service) as the old pods are terminating.

Relevant info

Possibly related issue

Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.

Possibly related issue 2

It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.

@deliahu deliahu added bug Something isn't working v0.15 labels Feb 14, 2020
@deliahu deliahu added v0.16 and removed v0.15 labels Mar 9, 2020
@deliahu deliahu changed the title Intermittent failed requests during rolling updates Intermittent failed requests during rolling updates (and possibly scale ups) Mar 25, 2020
@deliahu deliahu added v0.17 and removed v0.16 labels Apr 14, 2020
@deliahu deliahu removed the v0.17 label May 21, 2020
@deliahu deliahu added v0.20 and removed v0.19 labels Aug 27, 2020
@deliahu deliahu self-assigned this Aug 27, 2020
@deliahu deliahu removed their assignment Sep 17, 2020
@deliahu deliahu removed the v0.20 label Sep 18, 2020
@deliahu deliahu closed this as completed Nov 10, 2020
@deliahu deliahu changed the title Intermittent failed requests during rolling updates (and possibly scale ups) Intermittent failed requests during rolling updates Nov 10, 2020
@deliahu deliahu added the v0.22 label Nov 26, 2020
@deliahu deliahu added this to the v0.22 milestone Nov 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants