Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlee has empty response on some URLs #486

Closed
MrTyton opened this issue Sep 2, 2024 · 8 comments
Closed

Crawlee has empty response on some URLs #486

MrTyton opened this issue Sep 2, 2024 · 8 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@MrTyton
Copy link

MrTyton commented Sep 2, 2024

Hi,

I've got some very basic code trying to scrape a website. However, while crawlee seems to work fine for most of the website, some pages it just fails for

Fails:
https://www.mtggoldfish.com/deck/6610848
https://www.mtggoldfish.com/deck/6610848#paper
https://www.mtggoldfish.com/deck/6610847

Succeeds:
https://www.mtggoldfish.com/metagame/standard#paper
https://www.mtggoldfish.com/deck/download/6610848

I'm really not sure why this is happening.

@router.handler("deck")
async def deck_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Deck handler: {context.request.url}")
    deck_id = context.request.url.split("/")[-1].replace("#paper", "")
    context.log.info(f"Soup: {context.soup}")
    deck_information = context.soup.find(
        "p", class_="deck-container-information"
    )
    context.log.info(f"Deck information: {deck_information}")
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Deck handler: https://www.mtggoldfish.com/deck/6610848#paper
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Soup: 
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Deck information: None

Other pages all seem to work, so I'm unsure why it fails on.

Not really sure how to debug this, since it's not throwing any errors until that point.

Thanks.

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 2, 2024
@janbuchar
Copy link
Collaborator

janbuchar commented Sep 2, 2024

Hello, by checking the value of context.http_response.status_code, I found that the server returns the 406 HTTP status (not acceptable) code, along with an empty body. You can verify this by running curl -vvv https://www.mtggoldfish.com/deck/6610848.

That explains why the selector doesn't match. In such cases, we should probably throw an exception, or at least trigger a warning - we'll look into that.

Now, to fix this, you can use the CurlImpersonateHttpClient:

  1. run poetry add crawlee[beautifulsoup,curl-impersonate] to install curl-impersonate
  2. update the code that initializes your crawler to use the alternative http client:
    from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
    # ...
    crawler = BeautifulSoupCrawler(
      http_client = CurlImpersonateHttpClient()
      # ...
    )

@janbuchar janbuchar added the bug Something isn't working. label Sep 2, 2024
@janbuchar
Copy link
Collaborator

The JS version of Crawlee marks the request as failed without retries when it encounters a 4xx response. We will make this consistent.

@MrTyton
Copy link
Author

MrTyton commented Sep 2, 2024

Awesome, thank you. Minor correction though, it seems to be from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

@janbuchar
Copy link
Collaborator

Awesome, thank you. Minor correction though, it seems to be from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

Fixed, thanks!

@MrTyton
Copy link
Author

MrTyton commented Sep 2, 2024

Few things as I try and mess with this some more

Using the CurlImpersonateHttpClient adds this warning message upon program start, which doesn't seem to be fixed if I add in the command that it asks for

    asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())
    await crawler.run(["https://www.mtggoldfish.com/metagame/modern#paper"])
.venv\Lib\site-packages\curl_cffi\aio.py:137: RuntimeWarning:
    Proactor event loop does not implement add_reader family of methods required.
    Registering an additional selector thread for add_reader support.
    To avoid this warning use:
        asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())

Secondly, it seems to only partially resolve the issue. It scrapes a few hundred items, and then starts to fail with the same issue\

┌─────────────────────────────┬────────────────────────────┐
│ requests_finished           │ 388                        │
│ requests_failed             │ 1465                       │
│ retry_histogram             │ [388, 0, 0, 0, 0, 0, 0, 0, │
│                             │ 0, 1465]                   │
│ request_avg_failed_duration │ 0.06604                    │
│ request_avg_finished_durat… │ 0.769383                   │
│ requests_finished_per_minu… │ 154                        │
│ requests_failed_per_minute  │ 583                        │
│ request_total_duration      │ 395.268903                 │
│ requests_total              │ 1853                       │
│ crawler_runtime             │ 150.741988                 │
└─────────────────────────────┴────────────────────────────┘

and the error, once it starts spitting them out (on the maximum retry one), is the same, saying that the soup is empty. Rerunning it does get more information, so I'm not sure if it's a temporary throttling thing or not, but looking through the python docs I don't see a way to have it throttle the requests to make it like, smarter about it? I can add in a time.sleep(1), but that'll make the scrape take forever, and I feel like there should be a more elegant solution.

I can provide more code blocks if needed, but the links that it's failing on are the same style.

@MrTyton
Copy link
Author

MrTyton commented Sep 2, 2024

Yeah, trying to open the website as the scrape is showing up, while it's running, does give an "Throttled" response.

@janbuchar
Copy link
Collaborator

janbuchar commented Sep 3, 2024

Using the CurlImpersonateHttpClient adds this warning message upon program start, which doesn't seem to be fixed if I add in the command that it asks for

Weird. I opened #495 to track this.

Secondly, it seems to only partially resolve the issue. It scrapes a few hundred items, and then starts to fail with the same issue and the error, once it starts spitting them out (on the maximum retry one), is the same, saying that the soup is empty. Rerunning it does get more information, so I'm not sure if it's a temporary throttling thing or not, but looking through the python docs I don't see a way to have it throttle the requests to make it like, smarter about it? I can add in a time.sleep(1), but that'll make the scrape take forever, and I feel like there should be a more elegant solution.

You have multiple options here.

  1. use the concurrency_settings parameter of BeautifulSoupCrawler (doc) and set max_tasks_per_minute to something lower than infinity 🙂
  2. in your request handler, do something like this
if context.http_response.status_code >= 400:
  await asyncio.sleep(1)
  raise RuntimeError("Got throttled")  # this kind of exception should trigger a retry

# whatever you normally do here

@MrTyton
Copy link
Author

MrTyton commented Sep 4, 2024

context doesn't have a response field, it's http_response. That being said, you did put me on the right track - it seems like the site isn't actually returning a 429, it just returns this other page that says "Throttled"

I ended up just doing this

class WaitHandler:
    def __init__(self):
        self.seconds = 5
        self.total_times_throttled = 0
        self.time_spent_waiting = 0

    def handle(self, context):
        if (
            context.soup is None
            or "Throttled" in context.soup.text
        ):
            context.log.warning(
                f"Getting Throttled, waiting for {self.seconds} seconds"
            )
            sleep(self.seconds)
            self.time_spent_waiting += self.seconds
            self.total_times_throttled += 1
            context.log.warning("Resuming")
            self.seconds += random.randint(5, 10)
            raise RuntimeError(f"Throttled on {context.request.url}")
        self.seconds = 5

Going to close this for now. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants