You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observing a mismatch between the (scrapy) Response headers and the Response body.
How to reproduce?
Crawl a spider that would yield a single (scrapy) Request configured to be handled by scrapy-playwright as described in the readme.
In the "playwright_page_methods" meta, include a PageMethod as follows:
[a] Pass to the method argument, either the string "click" or a Callable which await calls the (playwright) Locator.click method.
[b] In the PageMethod or Locator (as chosen in [a]), pass to the selector argument, the selector of a hyperlink to a different page.
[tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
[note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least "commit").
What's happening in the background? (afaik)
In a simple example not involving any redirects, I believe the first (original) Request explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.
In a more nuanced example, the first request gets (HTTP 302) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.
The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) HTTP 302 redirect, causing a fourth and final request to be sent.
Summary so far...
There are 4 requests in the chain, Req. 1 ... Req. 4. There are 2 pages, the <Interim_Page> which contains the link the Playwright clicks on, and the <Requested_Page> that the spider parses.
I am observing that the resulting (scrapy) Response contains the body of the response to Req. 4, but the headers of the response to Req. 2. This confuses me, and maybe also confuses some scrapy middleware, unless my understanding of the situation is incomplete.
What do I like about the result?
The final resulting body should continue to correspond to Req. 4 by default as it is right now, in my opinion. Reasoning:
Generally, when the server has set up such a redirection pattern, the final data that it would send in response to Req. 4 would probably be highly similar to what it would have sent in response to Req. 1 had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the response body by default to correspond to the last and final request in the chain, which was Req. 4 in this case.
It seems like scrapy-playwright (or perhaps the RedirectMiddleware controls this behavior) reacts to the HTTP 302 redirect responses in a reasonable and expected manner. The system is being made to believe that the response body for Req. 2 was the response body for Req. 1, which makes sense because the browser redirected in response to a 302. The system is also told the response body for Req. 4 was the response body for Req. 3 for that same reason, as it should. At least in my scenario, the PageMethod click between Req. 2 and Req. 3 was intended similarly, and at making progress towards the <Requested_Page>, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting the body similar to how a redirect would.
The only response body that ends up reaching the spider's parser from my observations was the one in response to Req. 4 aka <Requested_Page>, which further validates my point that the Response object should continue to refer to that same body.
This behavior works seamlessly with the core of scrapy as discussed right above in [3.], and also with the HttpCacheMiddleware. To elaborate, the response_url in the cached meta file is that of the response to Req. 4, so the fact that the cached response_body file contains the response body of none other than Req. 4 promotes cache data consistency.
Finally, if a spider has instructions for executing a PageMethod to click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.
There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the body and the headers. The request for overriding the default behavior around body can be raised as a separate feature request if folks start to find the lack of it bothersome enough.
What should we change about the result?
The headers should match the body.
If every clog of this system wants to treat those 4 requests as a single chain, then we should embrace the chain. Why stop forwarding the responsibility of setting headers at Req. 2 when we continue to forward the responsibility of setting body through Req. 4?
When I inspected the value of the Response object and first noticed a different body matched with different headers, I wondered if it was a bug. I had no clue what was going on until my spider re-crawled many times, and I read through the logs and cache, monitored the network tab, and hypothesized the above "What's happening in the background? (afaik)" section.
Let us get the responsibility of setting headers forwarded through the "click" of a hyperlink (to another route) facilitated by scrapy-playwright, such that the final headers that would get associated with the (scrapy) Response would be that of the final response in the chain, and would thus match the body associated with that same response as well.
Why does this matter?
The Response 2 headers being in the same (scrapy) Response object as the Response 4 body just sounds wrong. I know that HttpCacheMiddleware also then stores the data wrong with the cached response_headers file matching neither the cached response_body file nor the response_url in the cached meta file. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that the headers match the body and the url.
Particularly, I saw that this messed up the RFC2616 cache policy in the HttpCacheMiddleware, e.g., say, Response 2 contained cache headers to indicate avoiding caching since the <Interim_Page> was an on-the-fly response to the client's session state. However, say, Response 4, or the <Requested_Page>, was relatively a static resource, so Response 4 contained cache headers to indicate caching that resource for some time. With the Response 2 headers being incorrectly respected for the Response 4 resource body, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cache headers to reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.
What is (possibly) going wrong?
Observing a mismatch between the (scrapy)
Response
headers and theResponse
body.How to reproduce?
yield
a single (scrapy)Request
configured to be handled byscrapy-playwright
as described in the readme."playwright_page_methods"
meta, include aPageMethod
as follows:[a] Pass to the
method
argument, either the string"click"
or aCallable
whichawait
calls the (playwright)Locator.click
method.[b] In the
PageMethod
orLocator
(as chosen in [a]), pass to theselector
argument, the selector of a hyperlink to a different page.[tldr;] Make the Playwright click on a hyperlink to a page different than the one the Playwright originally landed on (after any redirects).
[note] One needs to ensure that their code waits for the completion of browser navigation to the linked page (until at least
"commit"
).What's happening in the background? (afaik)
In a simple example not involving any redirects, I believe the first (original)
Request
explicitly sent by the spider creates a second, implicit request to the URL referred to by the hyperlink when the Playwright clicks on the link it received in response to the first request.In a more nuanced example, the first request gets (HTTP 302) redirected to a second request, the Playwright creates a third, implicit request when it clicks on the link it received in response to the second request.
The first example of this I noticed in the wild involved four requests. Everything happened as described in the three-request example right above this paragraph, but the response to the third request was (yet another) HTTP 302 redirect, causing a fourth and final request to be sent.
Summary so far...
There are 4 requests in the chain,
Req. 1
...Req. 4
. There are 2 pages, the<Interim_Page>
which contains the link the Playwright clicks on, and the<Requested_Page>
that the spider parses.Req. 1
--[ redirect ]-->Req. 2
--[ ok ]--><Interim_Page>
--[ click ]-->Req. 3
--[ redirect ]-->Req. 4
--[ ok ]--><Requested_Page>
What is the result?
I am observing that the resulting (scrapy)
Response
contains thebody
of the response toReq. 4
, but theheaders
of the response toReq. 2
. This confuses me, and maybe also confuses somescrapy
middleware, unless my understanding of the situation is incomplete.What do I like about the result?
The final resulting
body
should continue to correspond toReq. 4
by default as it is right now, in my opinion. Reasoning:Req. 4
would probably be highly similar to what it would have sent in response toReq. 1
had it skipped this redirection as a result of knowing that the client was returning to a page that they had previously visited in that same session. This validates my opinion that it is usually beneficial for the responsebody
by default to correspond to the last and final request in the chain, which wasReq. 4
in this case.scrapy-playwright
(or perhaps theRedirectMiddleware
controls this behavior) reacts to the HTTP 302 redirect responses in a reasonable and expected manner. The system is being made to believe that the responsebody
forReq. 2
was the responsebody
forReq. 1
, which makes sense because the browser redirected in response to a 302. The system is also told the responsebody
forReq. 4
was the responsebody
forReq. 3
for that same reason, as it should. At least in my scenario, thePageMethod
click betweenReq. 2
andReq. 3
was intended similarly, and at making progress towards the<Requested_Page>
, so it should continue to be treated similarly to a redirect by default. That is, the click should forward the responsibility of setting thebody
similar to how a redirect would.body
that ends up reaching the spider's parser from my observations was the one in response toReq. 4
aka<Requested_Page>
, which further validates my point that theResponse
object should continue to refer to that samebody
.HttpCacheMiddleware
. To elaborate, theresponse_url
in the cachedmeta
file is that of the response toReq. 4
, so the fact that the cachedresponse_body
file contains the responsebody
of none other thanReq. 4
promotes cache data consistency.PageMethod
to click on a link to a different page even before the current page is parsed, then I am happy with the default assumption that the author intended the click as a redirect unless explicitly stated otherwise by them.There may be cases that I haven't encountered where one might want the configurability to not treat such clicks as redirects. However, I would like to keep the scope of this issue I'm raising limited to clearing my confusion around or fixing the mismatch between the
body
and theheaders
. The request for overriding the default behavior aroundbody
can be raised as a separate feature request if folks start to find the lack of it bothersome enough.What should we change about the result?
The
headers
should match thebody
.Let us get the responsibility of setting
headers
forwarded through the "click" of a hyperlink (to another route) facilitated byscrapy-playwright
, such that the finalheaders
that would get associated with the (scrapy)Response
would be that of the final response in the chain, and would thus match thebody
associated with that same response as well.Why does this matter?
Response
object as the Response 4 body just sounds wrong. I know thatHttpCacheMiddleware
also then stores the data wrong with the cachedresponse_headers
file matching neither the cachedresponse_body
file nor theresponse_url
in the cachedmeta
file. There may be layers of scrapy or some custom spider code one might write that could malfunction from breaking the assumption that theheaders
match thebody
and theurl
.RFC2616
cache policy in theHttpCacheMiddleware
, e.g., say, Response 2 contained cacheheaders
to indicate avoiding caching since the<Interim_Page>
was an on-the-fly response to the client's session state. However, say, Response 4, or the<Requested_Page>
, was relatively a static resource, so Response 4 contained cacheheaders
to indicate caching that resource for some time. With the Response 2headers
being incorrectly respected for the Response 4 resourcebody
, the static resource was mistaken for a dynamic resource by scrapy's caching layer. The cache policy that depends on accurate cacheheaders
to reduce the spider's network request rate with the server became less effective. This has the potential danger of angering the server administrators over repeatedly requesting static resources that their server encourages clients to cache.Version Details
The text was updated successfully, but these errors were encountered: