Client retries for API-server errors #72

gregjones · 2019-11-05T12:03:18Z

We've been seeing issues where the API-server returns errors when fiaas-deploy-daemon is deploying things:

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces
Errors deploying ingresses during deploys of ingress-controllers by the cluster operators

It seems like having some retry-with-backofff in these situations would be helpful. One doubt is about the level this should live at, but I think the simplest implementation will be in the HTTP client used here: with requests, it can be as simple as supplying a retry-config, that enumerates the status-codes to retry for. Something like this, in k8s.client:

session = requests.Session()
retry_statuses = [requests.codes.too_many_requests, 
                             requests.codes.internal_server_error, 
                             requests.codes.bad_gateway,
                             requests.codes.service_unavailable,
                             requests.codes.gateway_timeout]
retries = Retry(total=10, backoff_factor=1, status_forcelist=retry_statuses, method_whitelist=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

This will retry for the listed statuses, on all HTTP methods, 10 times with a growing backoff,
starting at 0, then 2s, 4s, 8s etc. until the default max of 120s.

Does this seem reasonable?

The text was updated successfully, but these errors were encountered:

oyvindio · 2019-11-05T15:02:57Z

I think this makes a lot of sense, and would be useful to have in the client. There is a document essentially suggesting this behavior (exponential backoff) for many 500 range / server failure modes: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#error-codes

For the error types you've included in the example above, I think it is fine to implement the retry behavior in the client.
On a sidenote, there already is some retry behavior for 409 Conflict responses in fiaas-deploy-daemon, but it is implemented on top of the client calls. This error is a client/concurrency error though, and maybe it makes sense that it should be handled explicitly .

oyvindio · 2019-11-05T15:05:48Z

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces

The document I linked above suggests that clients "Read the Retry-After HTTP header from the response, and wait at least that long before retrying." if being rate limited with 429 StatusTooManyRequests

gregjones · 2019-11-05T15:42:23Z

Ok. From the docs, it looks like it respects Retry-After by default for 429 statuses; I'll double-check that the other options don't interfere with that default behaviour

This uses the urllib3 Retry to add retries with back-off to requests to the API-server that error. It will retry errors with the listed statuses, for all HTTP methods. If the response contains a Retry-After header (which we expect to be the case for 429/Too Many Requests), the delay it specifies will be respected. For other cases, it will back off exponentially (after one immediate retry, doubling) up to 10 times, with the library's maximum delay (120s). Fixes #72

gregjones mentioned this issue Nov 6, 2019

Add retries to HTTP client #73

Merged

gregjones closed this as completed in #73 Nov 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client retries for API-server errors #72

Client retries for API-server errors #72

gregjones commented Nov 5, 2019

oyvindio commented Nov 5, 2019

oyvindio commented Nov 5, 2019

gregjones commented Nov 5, 2019

Client retries for API-server errors #72

Client retries for API-server errors #72

Comments

gregjones commented Nov 5, 2019

oyvindio commented Nov 5, 2019

oyvindio commented Nov 5, 2019

gregjones commented Nov 5, 2019