GoComics download limiting #90

barcoboy · 2017-07-07T13:23:44Z

It appears that gocomics.com has implied some kind of rate limiting on downloading from their website. My daily download this morning failed, and this evening when I tried again, it started failing midway through. I verified this by running this BASH script on a different machine:

for c in {1..600};do wget "http://www.gocomics.com" -O /dev/null;done

After about 50 times of downloading the index page successfully, the server refuses all further connections.

Not sure what can be done about this, or if inserting a delay between downloads is enough to work around this rate limit.

The text was updated successfully, but these errors were encountered:

TobiX · 2017-07-07T21:36:01Z

Yeah, I'm aware of the issue. It doesn't happen with "small" batches, but after a certain number of requests, everything else fails.

We already have an asymptotic back-off if a request fails, but maybe we are still too aggressive for GoComics... (We had a random pause option once, maybe its time to bring that back)

barcoboy · 2017-07-08T17:47:16Z

I've played around with this a bit, and I found that a 2-3 second delay between requests is enough to prevent GoComics from blocking. I'm not sure though of the best place to put the delay command that will cause a delay after all HTTP requests. I see there is a time.sleep line already in scraper.py, but additional delays are needed when checking for an existing comic, not just after downloading.

While I was in the code, I also modified director.py so that the -n parameter could be used at the same time with the -c parameter by removing the "or self.options.cont" end portion of line 98. This allows me to download comics until either I find an existing one, or the maximum number per the -n parameter is reached, preventing a possible runaway massive download. I always thought that was what the -n option did, but it never worked for me because I would always use -c at the same time.

This allows fetching "all" comics (or catch up until the last existing one) while setting an upper bound on how many pages to fetch at the same time.

TobiX · 2019-12-03T23:33:15Z

I hope 66f154f fixes this. If you run into the problem again, try to fiddle around with the parameters of the add_throttle call. Parameters are minimum and maximum time between requests to a host - The time between requests is randomized between those boundaries...

TobiX added this to the Release 2.16 milestone Dec 19, 2017

TobiX added a commit that referenced this issue Dec 1, 2019

Allow combining -n with -c or -a (related to #90)

f5a5106

This allows fetching "all" comics (or catch up until the last existing one) while setting an upper bound on how many pages to fetch at the same time.

TobiX added the bug label Dec 3, 2019

TobiX closed this as completed in 66f154f Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoComics download limiting #90

GoComics download limiting #90

barcoboy commented Jul 7, 2017

TobiX commented Jul 7, 2017

barcoboy commented Jul 8, 2017

TobiX commented Dec 3, 2019

GoComics download limiting #90

GoComics download limiting #90

Comments

barcoboy commented Jul 7, 2017

TobiX commented Jul 7, 2017

barcoboy commented Jul 8, 2017

TobiX commented Dec 3, 2019