-
Notifications
You must be signed in to change notification settings - Fork 539
fix: respect EnqueueLinksKwargs for extract_links function
#1213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the support for EnqueueLinksKwargs parameters for the extract_links function in multiple crawler implementations. The key changes include:
- Adding new kwargs support (e.g., include, exclude, and limit) for link extraction in both Playwright and Abstract HTTP crawlers.
- Updating related tests in Playwright, Parsel, and BeautifulSoup crawlers to validate the new parameters.
- Updating import statements to include Glob from crawlee.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/crawlers/_playwright/test_playwright_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| tests/unit/crawlers/_parsel/test_parsel_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py | Updated test to use new kwargs parameters and added Glob import. |
| src/crawlee/crawlers/_playwright/_playwright_crawler.py | Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs. |
| src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py | Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed confusing, that the extract_links has the EnqueueLinksKwargsand not actually using them. I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though. Maybe it would be sufficient to remove EnqueueLinksKwargs from extract_links and keep it only on add_requests?
I think that even the issue reporter could have worked around the issue, by just passing the EnqueueLinksKwargs to the add_requests call even if it contained unfiltered urls.
Yeah, I thought of that too and I don't like it too. But it could be useful for users. Like in your example, when the user extracts links by applying filters from So it's a choice between user-friendliness and some redundancy in the code |
You are right, it is more straight forward and more convenient for the users. I can't really argue against my own example :-) |
|
I think this is fine as a fix and we should merge it. As the next step, do you think we could make some shared function for applying |
I totally agree! |
Description
EnqueueLinksKwargsparameters forextract_linksfunctionIssues