Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoComics download limiting #90

Closed
barcoboy opened this issue Jul 7, 2017 · 3 comments
Closed

GoComics download limiting #90

barcoboy opened this issue Jul 7, 2017 · 3 comments
Labels
Milestone

Comments

@barcoboy
Copy link

barcoboy commented Jul 7, 2017

It appears that gocomics.com has implied some kind of rate limiting on downloading from their website. My daily download this morning failed, and this evening when I tried again, it started failing midway through. I verified this by running this BASH script on a different machine:

for c in {1..600};do wget "http://www.gocomics.com" -O /dev/null;done

After about 50 times of downloading the index page successfully, the server refuses all further connections.

Not sure what can be done about this, or if inserting a delay between downloads is enough to work around this rate limit.

@TobiX
Copy link
Member

TobiX commented Jul 7, 2017

Yeah, I'm aware of the issue. It doesn't happen with "small" batches, but after a certain number of requests, everything else fails.

We already have an asymptotic back-off if a request fails, but maybe we are still too aggressive for GoComics... (We had a random pause option once, maybe its time to bring that back)

@barcoboy
Copy link
Author

barcoboy commented Jul 8, 2017

I've played around with this a bit, and I found that a 2-3 second delay between requests is enough to prevent GoComics from blocking. I'm not sure though of the best place to put the delay command that will cause a delay after all HTTP requests. I see there is a time.sleep line already in scraper.py, but additional delays are needed when checking for an existing comic, not just after downloading.

While I was in the code, I also modified director.py so that the -n parameter could be used at the same time with the -c parameter by removing the "or self.options.cont" end portion of line 98. This allows me to download comics until either I find an existing one, or the maximum number per the -n parameter is reached, preventing a possible runaway massive download. I always thought that was what the -n option did, but it never worked for me because I would always use -c at the same time.

@TobiX TobiX added this to the Release 2.16 milestone Dec 19, 2017
TobiX added a commit that referenced this issue Dec 1, 2019
This allows fetching "all" comics (or catch up until the last existing
one) while setting an upper bound on how many pages to fetch at the same
time.
@TobiX TobiX added the bug label Dec 3, 2019
@TobiX TobiX closed this as completed in 66f154f Dec 3, 2019
@TobiX
Copy link
Member

TobiX commented Dec 3, 2019

I hope 66f154f fixes this. If you run into the problem again, try to fiddle around with the parameters of the add_throttle call. Parameters are minimum and maximum time between requests to a host - The time between requests is randomized between those boundaries...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants