-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
429 Too Many Requests #367
Comments
I personally simply URL/requests caching across files indeed may help, but it does not rule out this issue completely, as long as there is a rate limit on GitHub even with token. |
The only proper solution is to understand and respect rate-limiting mechanisms for different websites.
This is not a standard however. Other services might use different headers and formats. More info here ( A simpler solution is to use a per-website rate-limiting factor. Something like If we used tower for the request handling, this would be an optional feature: https://docs.rs/tower/0.4.10/tower/limit/rate/struct.RateLimit.html, which is really cool. However, that is still a long way off and I'm not even sure if it provides everything else we'd need. Given that we need some way to rate limit requests per domain at some point anyway I'd say we could start with that. |
This is some code on how I envision rate limiting to look like in the future: https://github.com/james7132/Hourai/blob/4a4b343f237f726ae1d2e2943e0bed58406f2cf6/logger/hourai-feeds/src/reddit/rate_limiter.rs |
See also seanmonstar/reqwest#491 |
I test files in batches with separate lychee-actions. They run ~2 hours apart from each other. It would be cool if there was an arg that can set up the root directory. I think gists should also be tested with tokens. See lycheeverse/lychee-action#54 On the other hand there might be a way to fix it for good by splitting link checking process into 2 actions. |
Actually, there is https://github.com/marketplace/actions/schedule-job-action thingy. But it seems to require personal token. Wouldn't work for my use case. |
Other solution would be an introduction of "URL health". "Healthy" URL is an URL that has been successfully tested in last 24 hours and it doesn't need to be tested again for some time. 🤒 URL is an URL that has a recoverable error or timed out less than X times in a row. And 💀 URL is everything else. |
Note that this option does not define the root directory for input files but the root path or URL for the contained relative/internal links, like
Makes sense, probably all subdomains of |
Actually, if link checker would have some kind of state shared between runs, I like the idea of storing links and their statuses (health) in an issue for further reinspection. Maybe even using such issue as a config file of sorts. @MichaIng Should'v RTFM 🤕 🌴 |
Yeah that is planned. #163 |
Honoring rate limit headers would be a good idea and not too difficult, here's how it's done in pytest-check-links:
and Here's the spec for the Retry-After header https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After |
@rigtorp - thanks for the resources. This would be helpful for sites that include a Retry-After header, but unfortunately GitHub has its own set of headers. See the above comment by @mre. |
@dcroote They do support it, it's a standard header implemented by all the reverse proxies and caches. So it's the more generic solution. The rate limit headers are typically only implemented for APIs, where a API consumer can use them to throttle work items pulled from a queue for example. First to generate rate limit: while true; do curl --head "https://github.com/rigtorp/MPMCQueue"; done After some time:
Here github is asking to wait 60 seconds. |
Current status for people coming in just now:
With that, I guess we can close this and focus on the open issues for the missing functionality. |
We've seen
HTTP status client error (429 Too Many Requests) for url
for github.com urls intermittently even when using the GITHUB_TOKENenv
and reducing concurrency to 10 threads. A few examples of Action runs where this occurred: 1, 2, 3, 4 and here is the workflow itself. Note the workflow uses a cached installation of lychee installed via cargo from a specific commit sha rather than the v0.7.1 marketplace Action because of additional functionality that's been added since that release.I think the ideal solution would be as described in #348 to avoid hitting the same url repeatedly from different files (which is indeed the case with the urls causing failures in the Actions runs above), but I can see this is complex (#349).
Another theoretical idea could be running lychee twice, once with high concurrency using
--exclude github.com
and again with minimal concurrency and--include github.com
.A third might be identifying github.com urls and applying different internal handling of how they're requested?
Would love to get your thoughts and also confirm it isn't a usage issue.
Overall though great tool, it's helpful to check both internal and external urls + images / img tags quickly
The text was updated successfully, but these errors were encountered: