-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawlee has empty response on some URLs #486
Comments
Hello, by checking the value of That explains why the selector doesn't match. In such cases, we should probably throw an exception, or at least trigger a warning - we'll look into that. Now, to fix this, you can use the
|
The JS version of Crawlee marks the request as failed without retries when it encounters a 4xx response. We will make this consistent. |
Awesome, thank you. Minor correction though, it seems to be |
Fixed, thanks! |
Few things as I try and mess with this some more Using the
Secondly, it seems to only partially resolve the issue. It scrapes a few hundred items, and then starts to fail with the same issue\
and the error, once it starts spitting them out (on the maximum retry one), is the same, saying that the soup is empty. Rerunning it does get more information, so I'm not sure if it's a temporary throttling thing or not, but looking through the python docs I don't see a way to have it throttle the requests to make it like, smarter about it? I can add in a I can provide more code blocks if needed, but the links that it's failing on are the same style. |
Yeah, trying to open the website as the scrape is showing up, while it's running, does give an "Throttled" response. |
Weird. I opened #495 to track this.
You have multiple options here.
if context.http_response.status_code >= 400:
await asyncio.sleep(1)
raise RuntimeError("Got throttled") # this kind of exception should trigger a retry
# whatever you normally do here |
I ended up just doing this
Going to close this for now. Thanks for the help! |
Hi,
I've got some very basic code trying to scrape a website. However, while crawlee seems to work fine for most of the website, some pages it just fails for
Fails:
https://www.mtggoldfish.com/deck/6610848
https://www.mtggoldfish.com/deck/6610848#paper
https://www.mtggoldfish.com/deck/6610847
Succeeds:
https://www.mtggoldfish.com/metagame/standard#paper
https://www.mtggoldfish.com/deck/download/6610848
I'm really not sure why this is happening.
Other pages all seem to work, so I'm unsure why it fails on.
Not really sure how to debug this, since it's not throwing any errors until that point.
Thanks.
The text was updated successfully, but these errors were encountered: