-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GoComics download limiting #90
Comments
Yeah, I'm aware of the issue. It doesn't happen with "small" batches, but after a certain number of requests, everything else fails. We already have an asymptotic back-off if a request fails, but maybe we are still too aggressive for GoComics... (We had a random pause option once, maybe its time to bring that back) |
I've played around with this a bit, and I found that a 2-3 second delay between requests is enough to prevent GoComics from blocking. I'm not sure though of the best place to put the delay command that will cause a delay after all HTTP requests. I see there is a time.sleep line already in scraper.py, but additional delays are needed when checking for an existing comic, not just after downloading. While I was in the code, I also modified director.py so that the -n parameter could be used at the same time with the -c parameter by removing the "or self.options.cont" end portion of line 98. This allows me to download comics until either I find an existing one, or the maximum number per the -n parameter is reached, preventing a possible runaway massive download. I always thought that was what the -n option did, but it never worked for me because I would always use -c at the same time. |
This allows fetching "all" comics (or catch up until the last existing one) while setting an upper bound on how many pages to fetch at the same time.
I hope 66f154f fixes this. If you run into the problem again, try to fiddle around with the parameters of the add_throttle call. Parameters are minimum and maximum time between requests to a host - The time between requests is randomized between those boundaries... |
It appears that gocomics.com has implied some kind of rate limiting on downloading from their website. My daily download this morning failed, and this evening when I tried again, it started failing midway through. I verified this by running this BASH script on a different machine:
for c in {1..600};do wget "http://www.gocomics.com" -O /dev/null;done
After about 50 times of downloading the index page successfully, the server refuses all further connections.
Not sure what can be done about this, or if inserting a delay between downloads is enough to work around this rate limit.
The text was updated successfully, but these errors were encountered: