Handling of rate-limiting websites #6

oblique · 2021-08-16T21:45:39Z

I have a case where a webpage is using Cloudflare and after a number of requests I get 429 error. The response contains Retry-After header with a value of how many seconds we should send the next request.

Currently, is there a way to handle this?

The text was updated successfully, but these errors were encountered:

mattsse · 2021-08-16T22:32:57Z

429 is a rate limit error, meaning too many requests.
You can implement a timeout between every request

let config = CrawlerConfig::default().allow_domain_with_delay(
      "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(3_000)),
 );

oblique · 2021-08-19T19:05:32Z

Unfortunately it is not that simple. From trial and error approach I understood that some websites have multiple levels of rate-limiting. For example in my case I manage to make ~60 requests with 2 seconds delay before I got 429 and after I increase the delay to 5 seconds I manage to make ~400 requests but I got 429 again.

What if crawler has pause/resume functionality, so the user can implement their own rate-limiting. Is this acceptable?

mattsse · 2021-08-20T12:57:06Z

I see, the Response contains the statuscode and response headers that you can check for rate limiting.
One way to handle custom timeouts when you receive a 429 is with the client.crawl function (see inject async calls) that takes an async closure, where you can enforce a custom delay, for example with futures::Delay (take a look at the current RequestDelay impl for reference), this way you could add an additional Retry state.
however, this might not be very ergonomic. I'll gladly accept PRs that would improve this, maybe an additional schedule function for example.

oblique · 2021-08-20T20:33:38Z

I ended up using client.crawl as you mentioned.

This is my crawl async function:

async fn crawl_url(
    client: Arc<Client>,
    url: &str,
    state: State,
) -> Result<(reqwest::Response, Option<State>)> {
    static LIMITER: Lazy<Limiter> = Lazy::new(|| Limiter::new());

    loop {
        LIMITER.until_ready().await;

        let resp = client.get(url).send().await?;

        if resp.status().as_u16() == 429 {
            if let Some(retry_after) = resp
                .headers()
                .get("retry-after")
                .and_then(|v| v.to_str().ok())
                .and_then(|s| s.parse().ok())
            {
                let dur = Duration::from_secs(retry_after);
                LIMITER.update_sleep_for(dur).await;
            }
            continue;
        }

        return Ok((resp, Some(state)));
    }
}

This is how I inject it:

    crawler.crawl(|client| {
        crawl_url(
            client,
            &format!("https://example.com/?p={}", page + 1),
            State::Page(page + 1),
        )
    });

And this is my rate-limiter that handles 429:

use governor::clock::{Clock, MonotonicClock};
use governor::state::direct::NotKeyed;
use governor::state::InMemoryState;
use governor::{Quota, RateLimiter};
use std::time::Duration;
use tokio::sync::RwLock;
use tokio::time::{sleep, sleep_until, Instant};

pub struct Limiter {
    rate_limiter: RateLimiter<NotKeyed, InMemoryState, MonotonicClock>,
    sleep_until: RwLock<Instant>,
    clock: MonotonicClock,
}

impl Limiter {
    pub fn new() -> Self {
        let clock = MonotonicClock::default();
        let quota = Quota::with_period(Duration::from_secs(1)).unwrap();
        let rate_limiter = RateLimiter::direct_with_clock(quota, &clock);

        Limiter {
            rate_limiter,
            sleep_until: RwLock::new(Instant::now()),
            clock,
        }
    }

    pub async fn update_sleep_until(&self, tm: Instant) {
        let mut sleep_until = self.sleep_until.write().await;

        if *sleep_until < tm {
            *sleep_until = tm;
        }
    }

    pub async fn update_sleep_for(&self, dur: Duration) {
        let tm = Instant::now() + dur;
        self.update_sleep_until(tm).await;
    }

    pub async fn until_ready(&self) {
        loop {
            let tm = self.sleep_until.read().await.clone();

            if Instant::now() < tm {
                sleep_until(tm).await;
            }

            match self.rate_limiter.check() {
                Ok(()) => break,
                Err(tm) => {
                    let dur = tm.wait_time_from(self.clock.now());
                    sleep(dur).await;
                }
            }
        }
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of rate-limiting websites #6

Handling of rate-limiting websites #6

oblique commented Aug 16, 2021 •

edited

Loading

mattsse commented Aug 16, 2021

oblique commented Aug 19, 2021 •

edited

Loading

mattsse commented Aug 20, 2021 •

edited

Loading

oblique commented Aug 20, 2021

Handling of rate-limiting websites #6

Handling of rate-limiting websites #6

Comments

oblique commented Aug 16, 2021 • edited Loading

mattsse commented Aug 16, 2021

oblique commented Aug 19, 2021 • edited Loading

mattsse commented Aug 20, 2021 • edited Loading

oblique commented Aug 20, 2021

oblique commented Aug 16, 2021 •

edited

Loading

oblique commented Aug 19, 2021 •

edited

Loading

mattsse commented Aug 20, 2021 •

edited

Loading