Skip to content

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented May 30, 2025

Description

  • The goal of the refactoring is to create a new helper function to check EnqueueLinksKwargs, for use elsewhere in the code. The new function works as an iterator, to better fit the application as a filter.
  • Also updated related functions that currently have EnqueueLinksKwargs check logic: extract_links and _commit_request_handler_result.

@Mantisus Mantisus requested a review from Copilot May 30, 2025 09:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Refactors link extraction and enqueue logic by introducing a reusable filter iterator for EnqueueLinksKwargs and centralizing URL normalization.

  • Added _enqueue_links_filter_iterator and _convert_url_to_request_iterator helpers to streamline link filtering and conversion.
  • Updated Playwright and HTTP crawler extract_links to use to_absolute_url_iterator and the new filter iterator, removing inline normalization logic.
  • Removed duplicate enqueue helper functions and legacy URL conversion code across crawlers.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/crawlee/crawlers/_playwright/_playwright_crawler.py Replaced inline URL normalization and filtering in extract_links with to_absolute_url_iterator and _enqueue_links_filter_iterator, and removed the old enqueue helper.
src/crawlee/crawlers/_basic/_basic_crawler.py Introduced _enqueue_links_filter_iterator, _convert_url_to_request_iterator, updated enqueue logic, and dropped duplicate code.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Updated HTTP extract_links to leverage to_absolute_url_iterator and the new filter iterator, removing old filtering logic.
src/crawlee/_utils/urls.py Added to_absolute_url_iterator helper for converting streams of URLs to absolute.
Comments suppressed due to low confidence (2)

src/crawlee/crawlers/_basic/_basic_crawler.py:943

  • The function references urlparse but it is not imported in this file. Add from urllib.parse import urlparse to prevent a NameError.
parsed_origin_url = urlparse(origin_url)

src/crawlee/crawlers/_basic/_basic_crawler.py:1181

  • is_url_absolute and convert_to_absolute_url are used here but not imported. Add from crawlee._utils.urls import is_url_absolute, convert_to_absolute_url to the top of the file.
elif isinstance(url, str) and not is_url_absolute(url):

@Mantisus Mantisus self-assigned this May 30, 2025
@Mantisus Mantisus requested review from Pijukatel and janbuchar May 30, 2025 09:49
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to this change you were able to consolidate _create_enqueue_links_function from abstract http crawler and pw crawler to a single method in basic crawler, am I correct?

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jun 5, 2025

Thanks to this change you were able to consolidate _create_enqueue_links_function from abstract http crawler and pw crawler to a single method in basic crawler, am I correct?

Not quite. I moved _create_enqueue_links_function to the basic crawler, as this logic had already been consolidated. And it avoided code duplication.

The main purpose of PR is related to #1213 (comment), to localize checks for EnqueueLinksKwargs in one function.

@Pijukatel Pijukatel merged commit 4845cdd into apify:master Jun 6, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants