-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling of rate-limiting websites #6
Comments
429 is a rate limit error, meaning too many requests. let config = CrawlerConfig::default().allow_domain_with_delay(
"news.ycombinator.com",
// add a delay between requests
RequestDelay::Fixed(std::time::Duration::from_millis(3_000)),
); |
Unfortunately it is not that simple. From trial and error approach I understood that some websites have multiple levels of rate-limiting. For example in my case I manage to make ~60 requests with 2 seconds delay before I got 429 and after I increase the delay to 5 seconds I manage to make ~400 requests but I got 429 again. What if crawler has pause/resume functionality, so the user can implement their own rate-limiting. Is this acceptable? |
I see, the |
I ended up using This is my crawl async function: async fn crawl_url(
client: Arc<Client>,
url: &str,
state: State,
) -> Result<(reqwest::Response, Option<State>)> {
static LIMITER: Lazy<Limiter> = Lazy::new(|| Limiter::new());
loop {
LIMITER.until_ready().await;
let resp = client.get(url).send().await?;
if resp.status().as_u16() == 429 {
if let Some(retry_after) = resp
.headers()
.get("retry-after")
.and_then(|v| v.to_str().ok())
.and_then(|s| s.parse().ok())
{
let dur = Duration::from_secs(retry_after);
LIMITER.update_sleep_for(dur).await;
}
continue;
}
return Ok((resp, Some(state)));
}
} This is how I inject it: crawler.crawl(|client| {
crawl_url(
client,
&format!("https://example.com/?p={}", page + 1),
State::Page(page + 1),
)
}); And this is my rate-limiter that handles 429: use governor::clock::{Clock, MonotonicClock};
use governor::state::direct::NotKeyed;
use governor::state::InMemoryState;
use governor::{Quota, RateLimiter};
use std::time::Duration;
use tokio::sync::RwLock;
use tokio::time::{sleep, sleep_until, Instant};
pub struct Limiter {
rate_limiter: RateLimiter<NotKeyed, InMemoryState, MonotonicClock>,
sleep_until: RwLock<Instant>,
clock: MonotonicClock,
}
impl Limiter {
pub fn new() -> Self {
let clock = MonotonicClock::default();
let quota = Quota::with_period(Duration::from_secs(1)).unwrap();
let rate_limiter = RateLimiter::direct_with_clock(quota, &clock);
Limiter {
rate_limiter,
sleep_until: RwLock::new(Instant::now()),
clock,
}
}
pub async fn update_sleep_until(&self, tm: Instant) {
let mut sleep_until = self.sleep_until.write().await;
if *sleep_until < tm {
*sleep_until = tm;
}
}
pub async fn update_sleep_for(&self, dur: Duration) {
let tm = Instant::now() + dur;
self.update_sleep_until(tm).await;
}
pub async fn until_ready(&self) {
loop {
let tm = self.sleep_until.read().await.clone();
if Instant::now() < tm {
sleep_until(tm).await;
}
match self.rate_limiter.check() {
Ok(()) => break,
Err(tm) => {
let dur = tm.wait_time_from(self.clock.now());
sleep(dur).await;
}
}
}
}
} |
I have a case where a webpage is using Cloudflare and after a number of requests I get 429 error. The response contains
Retry-After
header with a value of how many seconds we should send the next request.Currently, is there a way to handle this?
The text was updated successfully, but these errors were encountered: