Crawlee has empty response on some URLs #486

MrTyton · 2024-09-02T00:50:30Z

Hi,

I've got some very basic code trying to scrape a website. However, while crawlee seems to work fine for most of the website, some pages it just fails for

Fails:
https://www.mtggoldfish.com/deck/6610848
https://www.mtggoldfish.com/deck/6610848#paper
https://www.mtggoldfish.com/deck/6610847

Succeeds:
https://www.mtggoldfish.com/metagame/standard#paper
https://www.mtggoldfish.com/deck/download/6610848

I'm really not sure why this is happening.

@router.handler("deck")
async def deck_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Deck handler: {context.request.url}")
    deck_id = context.request.url.split("/")[-1].replace("#paper", "")
    context.log.info(f"Soup: {context.soup}")
    deck_information = context.soup.find(
        "p", class_="deck-container-information"
    )
    context.log.info(f"Deck information: {deck_information}")

[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Deck handler: https://www.mtggoldfish.com/deck/6610848#paper
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Soup: 
[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO  Deck information: None

Other pages all seem to work, so I'm unsure why it fails on.

Not really sure how to debug this, since it's not throwing any errors until that point.

Thanks.

The text was updated successfully, but these errors were encountered:

janbuchar · 2024-09-02T09:20:21Z

Hello, by checking the value of context.http_response.status_code, I found that the server returns the 406 HTTP status (not acceptable) code, along with an empty body. You can verify this by running curl -vvv https://www.mtggoldfish.com/deck/6610848.

That explains why the selector doesn't match. In such cases, we should probably throw an exception, or at least trigger a warning - we'll look into that.

Now, to fix this, you can use the CurlImpersonateHttpClient:

run poetry add crawlee[beautifulsoup,curl-impersonate] to install curl-impersonate

update the code that initializes your crawler to use the alternative http client:

from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient
# ...
crawler = BeautifulSoupCrawler(
  http_client = CurlImpersonateHttpClient()
  # ...
)

janbuchar · 2024-09-02T09:57:25Z

The JS version of Crawlee marks the request as failed without retries when it encounters a 4xx response. We will make this consistent.

MrTyton · 2024-09-02T16:01:47Z

Awesome, thank you. Minor correction though, it seems to be from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

janbuchar · 2024-09-02T18:40:52Z

Awesome, thank you. Minor correction though, it seems to be from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

Fixed, thanks!

MrTyton · 2024-09-02T23:42:35Z

Few things as I try and mess with this some more

Using the CurlImpersonateHttpClient adds this warning message upon program start, which doesn't seem to be fixed if I add in the command that it asks for

    asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())
    await crawler.run(["https://www.mtggoldfish.com/metagame/modern#paper"])

.venv\Lib\site-packages\curl_cffi\aio.py:137: RuntimeWarning:
    Proactor event loop does not implement add_reader family of methods required.
    Registering an additional selector thread for add_reader support.
    To avoid this warning use:
        asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())

Secondly, it seems to only partially resolve the issue. It scrapes a few hundred items, and then starts to fail with the same issue\

┌─────────────────────────────┬────────────────────────────┐
│ requests_finished           │ 388                        │
│ requests_failed             │ 1465                       │
│ retry_histogram             │ [388, 0, 0, 0, 0, 0, 0, 0, │
│                             │ 0, 1465]                   │
│ request_avg_failed_duration │ 0.06604                    │
│ request_avg_finished_durat… │ 0.769383                   │
│ requests_finished_per_minu… │ 154                        │
│ requests_failed_per_minute  │ 583                        │
│ request_total_duration      │ 395.268903                 │
│ requests_total              │ 1853                       │
│ crawler_runtime             │ 150.741988                 │
└─────────────────────────────┴────────────────────────────┘

and the error, once it starts spitting them out (on the maximum retry one), is the same, saying that the soup is empty. Rerunning it does get more information, so I'm not sure if it's a temporary throttling thing or not, but looking through the python docs I don't see a way to have it throttle the requests to make it like, smarter about it? I can add in a time.sleep(1), but that'll make the scrape take forever, and I feel like there should be a more elegant solution.

I can provide more code blocks if needed, but the links that it's failing on are the same style.

MrTyton · 2024-09-02T23:54:37Z

Yeah, trying to open the website as the scrape is showing up, while it's running, does give an "Throttled" response.

janbuchar · 2024-09-03T08:23:17Z

Using the CurlImpersonateHttpClient adds this warning message upon program start, which doesn't seem to be fixed if I add in the command that it asks for

Weird. I opened #495 to track this.

Secondly, it seems to only partially resolve the issue. It scrapes a few hundred items, and then starts to fail with the same issue and the error, once it starts spitting them out (on the maximum retry one), is the same, saying that the soup is empty. Rerunning it does get more information, so I'm not sure if it's a temporary throttling thing or not, but looking through the python docs I don't see a way to have it throttle the requests to make it like, smarter about it? I can add in a time.sleep(1), but that'll make the scrape take forever, and I feel like there should be a more elegant solution.

You have multiple options here.

use the concurrency_settings parameter of BeautifulSoupCrawler (doc) and set max_tasks_per_minute to something lower than infinity 🙂
in your request handler, do something like this

if context.http_response.status_code >= 400:
  await asyncio.sleep(1)
  raise RuntimeError("Got throttled")  # this kind of exception should trigger a retry

# whatever you normally do here

MrTyton · 2024-09-04T00:32:06Z

context doesn't have a response field, it's http_response. That being said, you did put me on the right track - it seems like the site isn't actually returning a 429, it just returns this other page that says "Throttled"

I ended up just doing this

class WaitHandler:
    def __init__(self):
        self.seconds = 5
        self.total_times_throttled = 0
        self.time_spent_waiting = 0

    def handle(self, context):
        if (
            context.soup is None
            or "Throttled" in context.soup.text
        ):
            context.log.warning(
                f"Getting Throttled, waiting for {self.seconds} seconds"
            )
            sleep(self.seconds)
            self.time_spent_waiting += self.seconds
            self.total_times_throttled += 1
            context.log.warning("Resuming")
            self.seconds += random.randint(5, 10)
            raise RuntimeError(f"Throttled on {context.request.url}")
        self.seconds = 5

Going to close this for now. Thanks for the help!

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 2, 2024

janbuchar added the bug Something isn't working. label Sep 2, 2024

B4nan assigned vdusek Sep 2, 2024

janbuchar mentioned this issue Sep 3, 2024

CurlImpersonateHttpClient warning on Windows #495

Closed

MrTyton closed this as completed Sep 4, 2024

janbuchar mentioned this issue Sep 4, 2024

Throw an exception when we receive a 4xx status code #496

Closed

vdusek assigned janbuchar and unassigned vdusek Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawlee has empty response on some URLs #486

Crawlee has empty response on some URLs #486

MrTyton commented Sep 2, 2024

janbuchar commented Sep 2, 2024 •

edited

Loading

janbuchar commented Sep 2, 2024

MrTyton commented Sep 2, 2024

janbuchar commented Sep 2, 2024

MrTyton commented Sep 2, 2024

MrTyton commented Sep 2, 2024

janbuchar commented Sep 3, 2024 •

edited

Loading

MrTyton commented Sep 4, 2024

Crawlee has empty response on some URLs #486

Crawlee has empty response on some URLs #486

Comments

MrTyton commented Sep 2, 2024

janbuchar commented Sep 2, 2024 • edited Loading

janbuchar commented Sep 2, 2024

MrTyton commented Sep 2, 2024

janbuchar commented Sep 2, 2024

MrTyton commented Sep 2, 2024

MrTyton commented Sep 2, 2024

janbuchar commented Sep 3, 2024 • edited Loading

MrTyton commented Sep 4, 2024

janbuchar commented Sep 2, 2024 •

edited

Loading

janbuchar commented Sep 3, 2024 •

edited

Loading