-
Notifications
You must be signed in to change notification settings - Fork 539
refactor: add enqueue links filter iterator #1223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Refactors link extraction and enqueue logic by introducing a reusable filter iterator for EnqueueLinksKwargs and centralizing URL normalization.
- Added
_enqueue_links_filter_iteratorand_convert_url_to_request_iteratorhelpers to streamline link filtering and conversion. - Updated Playwright and HTTP crawler
extract_linksto useto_absolute_url_iteratorand the new filter iterator, removing inline normalization logic. - Removed duplicate enqueue helper functions and legacy URL conversion code across crawlers.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/crawlee/crawlers/_playwright/_playwright_crawler.py | Replaced inline URL normalization and filtering in extract_links with to_absolute_url_iterator and _enqueue_links_filter_iterator, and removed the old enqueue helper. |
| src/crawlee/crawlers/_basic/_basic_crawler.py | Introduced _enqueue_links_filter_iterator, _convert_url_to_request_iterator, updated enqueue logic, and dropped duplicate code. |
| src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py | Updated HTTP extract_links to leverage to_absolute_url_iterator and the new filter iterator, removing old filtering logic. |
| src/crawlee/_utils/urls.py | Added to_absolute_url_iterator helper for converting streams of URLs to absolute. |
Comments suppressed due to low confidence (2)
src/crawlee/crawlers/_basic/_basic_crawler.py:943
- The function references
urlparsebut it is not imported in this file. Addfrom urllib.parse import urlparseto prevent a NameError.
parsed_origin_url = urlparse(origin_url)
src/crawlee/crawlers/_basic/_basic_crawler.py:1181
is_url_absoluteandconvert_to_absolute_urlare used here but not imported. Addfrom crawlee._utils.urls import is_url_absolute, convert_to_absolute_urlto the top of the file.
elif isinstance(url, str) and not is_url_absolute(url):
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to this change you were able to consolidate _create_enqueue_links_function from abstract http crawler and pw crawler to a single method in basic crawler, am I correct?
Not quite. I moved The main purpose of PR is related to #1213 (comment), to localize checks for |
Description
EnqueueLinksKwargs, for use elsewhere in the code. The new function works as an iterator, to better fit the application as a filter.EnqueueLinksKwargscheck logic:extract_linksand_commit_request_handler_result.