Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client retries for API-server errors #72

Closed
gregjones opened this issue Nov 5, 2019 · 3 comments · Fixed by #73
Closed

Client retries for API-server errors #72

gregjones opened this issue Nov 5, 2019 · 3 comments · Fixed by #73

Comments

@gregjones
Copy link
Contributor

We've been seeing issues where the API-server returns errors when fiaas-deploy-daemon is deploying things:

  1. Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces
  2. Errors deploying ingresses during deploys of ingress-controllers by the cluster operators

It seems like having some retry-with-backofff in these situations would be helpful. One doubt is about the level this should live at, but I think the simplest implementation will be in the HTTP client used here: with requests, it can be as simple as supplying a retry-config, that enumerates the status-codes to retry for. Something like this, in k8s.client:

session = requests.Session()
retry_statuses = [requests.codes.too_many_requests, 
                             requests.codes.internal_server_error, 
                             requests.codes.bad_gateway,
                             requests.codes.service_unavailable,
                             requests.codes.gateway_timeout]
retries = Retry(total=10, backoff_factor=1, status_forcelist=retry_statuses, method_whitelist=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))

This will retry for the listed statuses, on all HTTP methods, 10 times with a growing backoff,
starting at 0, then 2s, 4s, 8s etc. until the default max of 120s.

Does this seem reasonable?

@oyvindio
Copy link
Member

oyvindio commented Nov 5, 2019

I think this makes a lot of sense, and would be useful to have in the client. There is a document essentially suggesting this behavior (exponential backoff) for many 500 range / server failure modes: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#error-codes

For the error types you've included in the example above, I think it is fine to implement the retry behavior in the client.
On a sidenote, there already is some retry behavior for 409 Conflict responses in fiaas-deploy-daemon, but it is implemented on top of the client calls. This error is a client/concurrency error though, and maybe it makes sense that it should be handled explicitly .

@oyvindio
Copy link
Member

oyvindio commented Nov 5, 2019

Rate-limiting errors when bulk-deploying new fiaas-deploy-daemon configs for a number of namespaces

The document I linked above suggests that clients "Read the Retry-After HTTP header from the response, and wait at least that long before retrying." if being rate limited with 429 StatusTooManyRequests

@gregjones
Copy link
Contributor Author

Ok. From the docs, it looks like it respects Retry-After by default for 429 statuses; I'll double-check that the other options don't interfere with that default behaviour

gregjones pushed a commit that referenced this issue Nov 6, 2019
This uses the urllib3 Retry to add retries with back-off to requests
to the API-server that error. It will retry errors with the listed
statuses, for all HTTP methods. If the response contains a Retry-After
header (which we expect to be the case for 429/Too Many Requests), the
delay it specifies will be respected. For other cases, it will back
off exponentially (after one immediate retry, doubling) up to 10 times,
with the library's maximum delay (120s).

Fixes #72
gregjones added a commit that referenced this issue Nov 6, 2019
This uses the urllib3 Retry to add retries with back-off to requests
to the API-server that error. It will retry errors with the listed
statuses, for all HTTP methods. If the response contains a Retry-After
header (which we expect to be the case for 429/Too Many Requests), the
delay it specifies will be respected. For other cases, it will back
off exponentially (after one immediate retry, doubling) up to 10 times,
with the library's maximum delay (120s).

Fixes #72
gregjones added a commit that referenced this issue Nov 6, 2019
This uses the urllib3 Retry to add retries with back-off to requests
to the API-server that error. It will retry errors with the listed
statuses, for all HTTP methods. If the response contains a Retry-After
header (which we expect to be the case for 429/Too Many Requests), the
delay it specifies will be respected. For other cases, it will back
off exponentially (after one immediate retry, doubling) up to 10 times,
with the library's maximum delay (120s).

Fixes #72
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants