Skip to content

Response body / header mismatch confusion when using PageMethod "click" on a link to a different page #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ish-101 opened this issue Feb 27, 2025 · 0 comments

Comments

@ish-101
Copy link

ish-101 commented Feb 27, 2025

What is (possibly) going wrong?

Observing a mismatch between the (scrapy) Response headers and the Response body.

How to reproduce?

  1. Crawl a spider that would yield a single (scrapy) Request configured to be handled by scrapy-playwright as described in the readme.
  2. In the "playwright_page_methods" meta, include a PageMethod as follows:
    [a] Pass to the method argument, either the string "click" or a Callable which await calls the (playwright) Locator.click method.
    [b] In the PageMethod or Locator (as chosen in [a]), pass to the selector argument, the selector of a hyperlink to a different page.
    [tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
    [note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least "commit").

What's happening in the background? (afaik)

In a simple example not involving any redirects, I believe the first (original) Request explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.

In a more nuanced example, the first request gets (HTTP 302) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.

The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) HTTP 302 redirect, causing a fourth and final request to be sent.

Summary so far...

There are 4 requests in the chain, Req. 1 ... Req. 4. There are 2 pages, the <Interim_Page> which contains the link the Playwright clicks on, and the <Requested_Page> that the spider parses.

Req. 1 --[ redirect ]--> Req. 2 --[ ok ]--> <Interim_Page> --[ click ]--> Req. 3 --[ redirect ]--> Req. 4 --[ ok ]--> <Requested_Page>

What is the result?

I am observing that the resulting (scrapy) Response contains the body of the response to Req. 4, but the headers of the response to Req. 2. This confuses me, and maybe also confuses some scrapy middleware, unless my understanding of the situation is incomplete.

What do I like about the result?

The final resulting body should continue to correspond to Req. 4 by default as it is right now, in my opinion. Reasoning:

  1. Generally, when the server has set up such a redirection pattern, the final data that it would send in response to Req. 4 would probably be highly similar to what it would have sent in response to Req. 1 had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the response body by default to correspond to the last and final request in the chain, which was Req. 4 in this case.
  2. It seems like scrapy-playwright (or perhaps the RedirectMiddleware controls this behavior) reacts to the HTTP 302 redirect responses in a reasonable and expected manner. The system is being made to believe that the response body for Req. 2 was the response body for Req. 1, which makes sense because the browser redirected in response to a 302. The system is also told the response body for Req. 4 was the response body for Req. 3 for that same reason, as it should. At least in my scenario, the PageMethod click between Req. 2 and Req. 3 was intended similarly, and at making progress towards the <Requested_Page>, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting the body similar to how a redirect would.
  3. The only response body that ends up reaching the spider's parser from my observations was the one in response to Req. 4 aka <Requested_Page>, which further validates my point that the Response object should continue to refer to that same body.
  4. This behavior works seamlessly with the core of scrapy as discussed right above in [3.], and also with the HttpCacheMiddleware. To elaborate, the response_url in the cached meta file is that of the response to Req. 4, so the fact that the cached response_body file contains the response body of none other than Req. 4 promotes cache data consistency.
  5. Finally, if a spider has instructions for executing a PageMethod to click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.

There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the body and the headers. The request for overriding the default behavior around body can be raised as a separate feature request if folks start to find the lack of it bothersome enough.

What should we change about the result?

The headers should match the body.

If every clog of this system wants to treat those 4 requests as a single chain, then we should embrace the chain. Why stop forwarding the responsibility of setting headers at Req. 2 when we continue to forward the responsibility of setting body through Req. 4?

When I inspected the value of the Response object and first noticed a different body matched with different headers, I wondered if it was a bug. I had no clue what was going on until my spider re-crawled many times, and I read through the logs and cache, monitored the network tab, and hypothesized the above "What's happening in the background? (afaik)" section.

Let us get the responsibility of setting headers forwarded through the "click" of a hyperlink (to another route) facilitated by scrapy-playwright, such that the final headers that would get associated with the (scrapy) Response would be that of the final response in the chain, and would thus match the body associated with that same response as well.

Why does this matter?

  1. The Response 2 headers being in the same (scrapy) Response object as the Response 4 body just sounds wrong. I know that HttpCacheMiddleware also then stores the data wrong with the cached response_headers file matching neither the cached response_body file nor the response_url in the cached meta file. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that the headers match the body and the url.
  2. Particularly, I saw that this messed up the RFC2616 cache policy in the HttpCacheMiddleware, e.g., say, Response 2 contained cache headers to indicate avoiding caching since the <Interim_Page> was an on-the-fly response to the client's session state. However, say, Response 4, or the <Requested_Page>, was relatively a static resource, so Response 4 contained cache headers to indicate caching that resource for some time. With the Response 2 headers being incorrectly respected for the Response 4 resource body, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cache headers to reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.

Version Details

scrapy-playwright==0.0.42
Scrapy==2.12.0
playwright==1.50.0
python==3.13.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant