-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for pre_navigation_hooks
in PlaywrightCrawler
#602
feat: Add support for pre_navigation_hooks
in PlaywrightCrawler
#602
Conversation
@janbuchar Why is this not merged |
We value your contribution, but please, keep your tone civil. |
pre_navigation_hooks
in Playwright
pre_navigation_hooks
in Playwrightpre_navigation_hooks
in PlaywrightCrawler
And let me add a third reason, there are no tests nor documentation for this new feature. Also, this option should be available for all crawler types (except BasicCrawler which is not making any requests on its own), not just playwright. |
@@ -190,15 +201,13 @@ async def _handle_blocked_request( | |||
if self._retry_on_blocked: | |||
status_code = crawling_context.response.status | |||
|
|||
# Check if the session is blocked based on the HTTP status code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't delete random comments.
from ._playwright_crawling_context import PlaywrightCrawlingContext | ||
except ImportError as exc: | ||
raise ImportError( | ||
"To import anything from this subpackage, you need to install the 'playwright' extra." | ||
"For example, if you use pip, run `pip install 'crawlee[playwright]'`.", | ||
) from exc | ||
|
||
__all__ = ['PlaywrightCrawler', 'PlaywrightCrawlingContext'] | ||
__all__ = ['PlaywrightCrawler', 'PlaywrightCrawlingContext', 'PlaywrightHook'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to export this type?
@@ -48,6 +49,7 @@ def __init__( | |||
browser_pool: BrowserPool | None = None, | |||
browser_type: BrowserType | None = None, | |||
headless: bool | None = None, | |||
pre_navigation_hooks: List[PlaywrightHook] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a perfectly valid way of passing the hooks, but have you considered adding a decorator instead/as well? We did a similar change for request handlers because Python does not have anonymous functions (that can contain more than one statement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will work on adding a decorator also mate
response=None, | ||
enqueue_links=None, # We'll set this later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a no-go. First, it won't pass the type checker. Second, nullable attributes lead to errors, mental overhead and hard to read code.
Instead, we should add one more parent class to PlaywrightCrawlingContext
that only adds the page
attribute. Then, we should split the _make_http_request
into two stages - one that opens the page and executes prenavigation hooks, and second that does the navigation. The context pipeline (see line 78) should be responsible for chaining these.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes will work on that
Hi Mate I am really sorry for what is said i really did not mean it that way. |
Closing in favor of #631 |
Description
This implementation adds support for preNavigationHooks in the PlaywrightCrawler
Use the new pre_navigation_hooks:
`async def example_hook(context: PlaywrightCrawlingContext, goto_options: dict) -> None:
await context.page.evaluate("window.foo = 'bar';")
goto_options['timeout'] = 60000 # Set a custom timeout for navigation
crawler = PlaywrightCrawler(
# other options...
pre_navigation_hooks=[example_hook]
)`
Issues
#427