429 Too Many Requests #367

dcroote · 2021-10-22T04:58:02Z

We've seen HTTP status client error (429 Too Many Requests) for url for github.com urls intermittently even when using the GITHUB_TOKEN env and reducing concurrency to 10 threads. A few examples of Action runs where this occurred: 1, 2, 3, 4 and here is the workflow itself. Note the workflow uses a cached installation of lychee installed via cargo from a specific commit sha rather than the v0.7.1 marketplace Action because of additional functionality that's been added since that release.

I think the ideal solution would be as described in #348 to avoid hitting the same url repeatedly from different files (which is indeed the case with the urls causing failures in the Actions runs above), but I can see this is complex (#349).

Another theoretical idea could be running lychee twice, once with high concurrency using --exclude github.com and again with minimal concurrency and --include github.com.

A third might be identifying github.com urls and applying different internal handling of how they're requested?

Would love to get your thoughts and also confirm it isn't a usage issue.

Overall though great tool, it's helpful to check both internal and external urls + images / img tags quickly

The text was updated successfully, but these errors were encountered:

MichaIng · 2021-10-23T12:37:17Z

I personally simply -a 429. The error is specific enough to apply that in general and on every run different GitHub URLs are affected so that all of them are checked sufficiently regularly 😄.

URL/requests caching across files indeed may help, but it does not rule out this issue completely, as long as there is a rate limit on GitHub even with token.

mre · 2021-10-26T16:37:54Z

The only proper solution is to understand and respect rate-limiting mechanisms for different websites.
Github sends rate-limiting headers with every response:

$ curl -I https://api.github.com/users/octocat/orgs

> HTTP/2 200
> Server: nginx
> Date: Fri, 12 Oct 2012 23:33:14 GMT
> Content-Type: application/json; charset=utf-8
> ETag: "a00049ba79152d03380c34652f2cb612"
> X-GitHub-Media-Type: github.v3
> X-RateLimit-Limit: 5000
> X-RateLimit-Remaining: 4987
> X-RateLimit-Reset: 1350085394
> Content-Length: 5
> Cache-Control: max-age=0, private, must-revalidate
> X-Content-Type-Options: nosniff

This is not a standard however. Other services might use different headers and formats. More info here (⚠️ Medium link).

A simpler solution is to use a per-website rate-limiting factor. Something like --rate-limit github.com,1, where the first part is a domain (a regular expression really) and the second part is the max number of requests per second. It could be below 1 such as 0.1 for max 1 request every 10 seconds.
This does not fully solve the problem of course, but it would be a start. There might be a global rate limiting factor as well that would be set to infinite in the beginning and be overwritten. That would probably fix 80% of the problems at the cost of slowing down execution with many links pointing to a website that could handle higher concurrency.

If we used tower for the request handling, this would be an optional feature: https://docs.rs/tower/0.4.10/tower/limit/rate/struct.RateLimit.html, which is really cool. However, that is still a long way off and I'm not even sure if it provides everything else we'd need. Given that we need some way to rate limit requests per domain at some point anyway I'd say we could start with that.

mre · 2021-10-26T17:04:46Z

This is some code on how I envision rate limiting to look like in the future: https://github.com/james7132/Hourai/blob/4a4b343f237f726ae1d2e2943e0bed58406f2cf6/logger/hourai-feeds/src/reddit/rate_limiter.rs
If anyone wants to give a shot, post a comment here for help.

mre · 2021-10-26T17:19:46Z

See also seanmonstar/reqwest#491

JAremko · 2021-11-03T22:07:49Z

I test files in batches with separate lychee-actions. They run ~2 hours apart from each other. It would be cool if there was an arg that can set up the root directory. -basedoesn't work in the action for some reason. I ended up encoding it in the file path doc/**/*.html

I think gists should also be tested with tokens. See lycheeverse/lychee-action#54

On the other hand there might be a way to fix it for good by splitting link checking process into 2 actions.
First one collects all the unique URLs (normalizing, removing # etc) and stores them somehow. Maybe simply output them in a parseble form as a part of CI log. Or maybe in issue.
The second action goes through this list and tests the links, if it hits recoverable error it generates another list and the cycle continues 😁 It would be much less "hacky" if there was a way to schedule an action from an action 🤔

JAremko · 2021-11-03T22:11:58Z

Actually, there is https://github.com/marketplace/actions/schedule-job-action thingy. But it seems to require personal token. Wouldn't work for my use case.

JAremko · 2021-11-03T22:17:35Z

Other solution would be an introduction of "URL health". "Healthy" URL is an URL that has been successfully tested in last 24 hours and it doesn't need to be tested again for some time. 🤒 URL is an URL that has a recoverable error or timed out less than X times in a row. And 💀 URL is everything else.

MichaIng · 2021-11-03T22:18:30Z

--base doesn't work in the action for some reason.

Note that this option does not define the root directory for input files but the root path or URL for the contained relative/internal links, like <a href="/subdir/page.html">. For the inputs you hence still need to define file names, a regex or an URL you want to check links on.

I think gists should also be tested with tokens.

Makes sense, probably all subdomains of github.com should use the token as well.

JAremko · 2021-11-03T22:29:41Z

Actually, if link checker would have some kind of state shared between runs, I like the idea of storing links and their statuses (health) in an issue for further reinspection. Maybe even using such issue as a config file of sorts.
The action can update that config issue with the list of ignored URLs, add the dates when an URL has been tested last time, what URLs are in "temporary outage" list and what links will it test at the next run. It would be much easier to see whats going on 🤔

@MichaIng Should'v RTFM 🤕 🌴

mre · 2021-11-04T11:22:14Z

Other solution would be an introduction of "URL health". "Healthy" URL is an URL that has been successfully tested in last 24 hours and it doesn't need to be tested again for some time. 🤒

Yeah that is planned. #163

rigtorp · 2021-11-23T00:27:06Z

Honoring rate limit headers would be a good idea and not too difficult, here's how it's done in pytest-check-links:
https://github.com/jupyterlab/pytest-check-links/blob/master/pytest_check_links/plugin.py#L301

        if response.status_code >= 400:
            if retries and self.sleep(response.headers):
                self.uncache_url(url_no_anchor)
                return self.fetch_with_retries(url, retries=retries - 1)

and self.sleep tries to sleep if the Retry-After header is present (https://github.com/jupyterlab/pytest-check-links/blob/master/pytest_check_links/plugin.py#L243).

Here's the spec for the Retry-After header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

dcroote · 2021-11-27T17:33:37Z

@rigtorp - thanks for the resources. This would be helpful for sites that include a Retry-After header, but unfortunately GitHub has its own set of headers. See the above comment by @mre.

rigtorp · 2021-12-01T21:57:12Z

@dcroote They do support it, it's a standard header implemented by all the reverse proxies and caches. So it's the more generic solution. The rate limit headers are typically only implemented for APIs, where a API consumer can use them to throttle work items pulled from a queue for example.

First to generate rate limit:

while true; do curl --head "https://github.com/rigtorp/MPMCQueue"; done

After some time:

HTTP/2 429 
date: Wed, 01 Dec 2021 21:51:55 GMT
server: Varnish
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-content-type-options: nosniff
x-frame-options: deny
x-xss-protection: 1; mode=block
content-security-policy: default-src 'none'; style-src 'unsafe-inline'
content-type: text/html; charset=utf-8
retry-after: 60
content-length: 1474
x-github-request-id: B2AC:C7D8:6E231:76832:61A7EE7B

Here github is asking to wait 60 seconds.

mre · 2022-03-28T10:41:13Z

Current status for people coming in just now:

Response caching was implemented in Response caching? #163. Enable it with --cache.
If you're getting 429s on Github, check that you have a GITHUB_TOKEN set. This can significantly improve the situation by allowing way more requests.
You can exclude domains with --exclude pattern, e.g. --exclude github.com.
You can reduce the number of concurrent requests to avoid running into the rate-limiter: --max-concurrency
You can run lychee multiple times on a subset of the requests as @JAremko described above.
I wrote a crate for parsing rate limit headers: https://github.com/mre/rate-limits. Planning to use it in combination with Look into tower as a replacement for deadpool + channels #36 to respect the server settings.

With that, I guess we can close this and focus on the open issues for the missing functionality.

mre mentioned this issue Nov 3, 2021

Check gist and user profile links using GitHub tokens lycheeverse/lychee-action#54

Closed

micalevisk mentioned this issue Dec 9, 2021

Add github workflow to check links and open issues reporting them jwasham/coding-interview-university#1028

Merged

mre mentioned this issue Jan 23, 2022

Option to ignore hubcaps failures #474

Closed

mre added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Feb 4, 2022

mre closed this as completed Mar 28, 2022

mre mentioned this issue Mar 28, 2022

Support parsing Retry-After header mre/rate-limits#1

Closed

koppor mentioned this issue Jan 3, 2024

Ignore 429 at link check JabRef/jabref#10743

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

429 Too Many Requests #367

429 Too Many Requests #367

dcroote commented Oct 22, 2021

MichaIng commented Oct 23, 2021

mre commented Oct 26, 2021

mre commented Oct 26, 2021

mre commented Oct 26, 2021

JAremko commented Nov 3, 2021

JAremko commented Nov 3, 2021

JAremko commented Nov 3, 2021

MichaIng commented Nov 3, 2021

JAremko commented Nov 3, 2021 •

edited

Loading

mre commented Nov 4, 2021

rigtorp commented Nov 23, 2021 •

edited

Loading

dcroote commented Nov 27, 2021

rigtorp commented Dec 1, 2021

mre commented Mar 28, 2022

429 Too Many Requests #367

429 Too Many Requests #367

Comments

dcroote commented Oct 22, 2021

MichaIng commented Oct 23, 2021

mre commented Oct 26, 2021

mre commented Oct 26, 2021

mre commented Oct 26, 2021

JAremko commented Nov 3, 2021

JAremko commented Nov 3, 2021

JAremko commented Nov 3, 2021

MichaIng commented Nov 3, 2021

JAremko commented Nov 3, 2021 • edited Loading

mre commented Nov 4, 2021

rigtorp commented Nov 23, 2021 • edited Loading

dcroote commented Nov 27, 2021

rigtorp commented Dec 1, 2021

mre commented Mar 28, 2022

JAremko commented Nov 3, 2021 •

edited

Loading

rigtorp commented Nov 23, 2021 •

edited

Loading