Improve handling of Jira's retry-after handling #1825

matthias-bach-by · 2024-02-26T16:49:49Z

As reported in #1805, while 3.6.0 brought a beneficial change in evaluating Jira's retry-after header, in some scenarios it actually degraded the success rate of requests when hitting rate-limits due to two issues:

Jira intends the retry-after header to be a minimum backoff but the code treats it as a maximum backoff, and
Jira sometimes sends a retry-after value of 0 seconds which the existing code implicitly treated as a failed request.

To address this, this PR changes the back-off behaviour to treat the retry-after time as a minimum backoff, at least as long as our maximum back-off time does allow this, and clips the retry-after header, if specified, to a minimum of one second.

I have extended the existing tests for the retry logic to assert this behaviour. If you would prefer a separate set of tests or see other changes to the approach, let me know and I'll adjust the PR.

The time Jira sends in the Retry-After header is the minimum time Jira wants us to wait before retrying our request. However, the former implementation used this as a maximum waiting time for the next request. In result, there was a chance that we reached three retries without reaching the time that Jira expected us to wait and our request would fail. This implementation does also affect the other retry cases, as while previously we jittered our backoff between 0 and the target backoff, we now only jitter between 50% and 100% of the target backoff. However, this should still protect us from thundering herds and safes us from introducing a new minimum backoff variable for the retry-after case. This solves one of the issues reported in pycontribs#1805.

When rejecting request with a 429 response, Jira sometimes sends a Retry-after header asking for a backoff of 0 seconds. With the existing retry logic this would mark the request as non-retryable and thus fail the request. With this change, such requests are treated as if Jira had send a retry-after value of 1 second. This solves one of the issues reported in pycontribs#1805.

matthias-bach-by · 2024-02-27T10:36:39Z

The cloud tests are failing in set-up and this seems to be a wider issue unrelated to my changes as I am also seeing that on other PRs. Happy to rebase once that is fixed on main.

adehad · 2024-03-21T13:11:41Z

Thanks for the excellent PR and sorry for the delay in getting in merged and released!

matthias-bach-by added 2 commits February 26, 2024 17:07

matthias-bach-by had a problem deploying to ack February 26, 2024 16:49 — with GitHub Actions Failure

matthias-bach-by had a problem deploying to cloud February 26, 2024 17:00 — with GitHub Actions Failure

adehad added the bug label Mar 21, 2024

adehad temporarily deployed to ack March 21, 2024 12:59 — with GitHub Actions Inactive

adehad approved these changes Mar 21, 2024

View reviewed changes

adehad linked an issue Mar 21, 2024 that may be closed by this pull request

Regression of 429 response handling in 3.6.0 #1805

Closed

4 tasks

adehad merged commit 4999a76 into pycontribs:main Mar 21, 2024
10 of 12 checks passed

matthias-bach-by deleted the improve-rate-limit-resilience branch March 21, 2024 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of Jira's retry-after handling #1825

Improve handling of Jira's retry-after handling #1825

matthias-bach-by commented Feb 26, 2024

matthias-bach-by commented Feb 27, 2024

adehad commented Mar 21, 2024

Improve handling of Jira's retry-after handling #1825

Improve handling of Jira's retry-after handling #1825

Conversation

matthias-bach-by commented Feb 26, 2024

matthias-bach-by commented Feb 27, 2024

adehad commented Mar 21, 2024