refactor: add enqueue links filter iterator #1223

Mantisus · 2025-05-30T09:34:43Z

Description

The goal of the refactoring is to create a new helper function to check EnqueueLinksKwargs, for use elsewhere in the code. The new function works as an iterator, to better fit the application as a filter.
Also updated related functions that currently have EnqueueLinksKwargs check logic: extract_links and _commit_request_handler_result.

Copilot

Pull Request Overview

Refactors link extraction and enqueue logic by introducing a reusable filter iterator for EnqueueLinksKwargs and centralizing URL normalization.

Added _enqueue_links_filter_iterator and _convert_url_to_request_iterator helpers to streamline link filtering and conversion.
Updated Playwright and HTTP crawler extract_links to use to_absolute_url_iterator and the new filter iterator, removing inline normalization logic.
Removed duplicate enqueue helper functions and legacy URL conversion code across crawlers.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
src/crawlee/crawlers/_playwright/_playwright_crawler.py	Replaced inline URL normalization and filtering in `extract_links` with `to_absolute_url_iterator` and `_enqueue_links_filter_iterator`, and removed the old enqueue helper.
src/crawlee/crawlers/_basic/_basic_crawler.py	Introduced `_enqueue_links_filter_iterator`, `_convert_url_to_request_iterator`, updated enqueue logic, and dropped duplicate code.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py	Updated HTTP `extract_links` to leverage `to_absolute_url_iterator` and the new filter iterator, removing old filtering logic.
src/crawlee/_utils/urls.py	Added `to_absolute_url_iterator` helper for converting streams of URLs to absolute.

Comments suppressed due to low confidence (2)

src/crawlee/crawlers/_basic/_basic_crawler.py:943

The function references urlparse but it is not imported in this file. Add from urllib.parse import urlparse to prevent a NameError.

parsed_origin_url = urlparse(origin_url)

src/crawlee/crawlers/_basic/_basic_crawler.py:1181

is_url_absolute and convert_to_absolute_url are used here but not imported. Add from crawlee._utils.urls import is_url_absolute, convert_to_absolute_url to the top of the file.

elif isinstance(url, str) and not is_url_absolute(url):

src/crawlee/crawlers/_playwright/_playwright_crawler.py

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py

vdusek

Thanks to this change you were able to consolidate _create_enqueue_links_function from abstract http crawler and pw crawler to a single method in basic crawler, am I correct?

Mantisus · 2025-06-05T08:28:14Z

Thanks to this change you were able to consolidate _create_enqueue_links_function from abstract http crawler and pw crawler to a single method in basic crawler, am I correct?

Not quite. I moved _create_enqueue_links_function to the basic crawler, as this logic had already been consolidated. And it avoided code duplication.

The main purpose of PR is related to #1213 (comment), to localize checks for EnqueueLinksKwargs in one function.

Mantisus added 3 commits May 30, 2025 09:22

add enqueue links filter iterator

607816c

update function name

7bf98d4

add docstring

b8060f6

Mantisus requested a review from Copilot May 30, 2025 09:34

Copilot AI reviewed May 30, 2025

View reviewed changes

src/crawlee/crawlers/_playwright/_playwright_crawler.py Show resolved Hide resolved

src/crawlee/crawlers/_playwright/_playwright_crawler.py Show resolved Hide resolved

src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Show resolved Hide resolved

Mantisus self-assigned this May 30, 2025

fix playwright skipped

1f25082

Mantisus requested review from Pijukatel and janbuchar May 30, 2025 09:49

Pijukatel approved these changes Jun 4, 2025

View reviewed changes

vdusek reviewed Jun 5, 2025

View reviewed changes

vdusek approved these changes Jun 5, 2025

View reviewed changes

Pijukatel merged commit 4845cdd into apify:master Jun 6, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: add enqueue links filter iterator #1223

refactor: add enqueue links filter iterator #1223

Mantisus commented May 30, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Uh oh!

Mantisus commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

refactor: add enqueue links filter iterator #1223

refactor: add enqueue links filter iterator #1223

Conversation

Mantisus commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Mantisus commented Jun 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mantisus commented May 30, 2025 •

edited

Loading