Skip to content

Conversation

@Mantisus
Copy link
Collaborator

Description

  • Fix support EnqueueLinksKwargs parameters for extract_links function

Issues

@Mantisus Mantisus requested review from Pijukatel, Copilot and vdusek and removed request for vdusek May 23, 2025 05:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the support for EnqueueLinksKwargs parameters for the extract_links function in multiple crawler implementations. The key changes include:

  • Adding new kwargs support (e.g., include, exclude, and limit) for link extraction in both Playwright and Abstract HTTP crawlers.
  • Updating related tests in Playwright, Parsel, and BeautifulSoup crawlers to validate the new parameters.
  • Updating import statements to include Glob from crawlee.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/crawlers/_playwright/test_playwright_crawler.py Updated test to use new kwargs parameters and added Glob import.
tests/unit/crawlers/_parsel/test_parsel_crawler.py Updated test to use new kwargs parameters and added Glob import.
tests/unit/crawlers/_beautifulsoup/test_beautifulsoup_crawler.py Updated test to use new kwargs parameters and added Glob import.
src/crawlee/crawlers/_playwright/_playwright_crawler.py Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs.
src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Modified extract_links to explicitly handle strategy, include, exclude, and limit kwargs.

@Mantisus Mantisus self-assigned this May 23, 2025
Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed confusing, that the extract_links has the EnqueueLinksKwargsand not actually using them. I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though. Maybe it would be sufficient to remove EnqueueLinksKwargs from extract_links and keep it only on add_requests?

I think that even the issue reporter could have worked around the issue, by just passing the EnqueueLinksKwargs to the add_requests call even if it contained unfiltered urls.

@Mantisus
Copy link
Collaborator Author

I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.

Yeah, I thought of that too and I don't like it too.

But it could be useful for users. Like in your example, when the user extracts links by applying filters from EnqueueLinksKwargs and then does something else with them.

So it's a choice between user-friendliness and some redundancy in the code

@Pijukatel
Copy link
Collaborator

I am not sure if want to duplicate the filtering logic that is already in the _commit_request_handler_result though.

Yeah, I thought of that too and I don't like it too.

But it could be useful for users. Like in your example, when the user extracts links by applying filters from EnqueueLinksKwargs and then does something else with them.

So it's a choice between user-friendliness and some redundancy in the code

You are right, it is more straight forward and more convenient for the users. I can't really argue against my own example :-)

@janbuchar
Copy link
Collaborator

I think this is fine as a fix and we should merge it. As the next step, do you think we could make some shared function for applying EnqueueLinksKwargs to a bunch of requests? I believe it would make the resulting code easier to navigate.

@Mantisus
Copy link
Collaborator Author

As the next step, do you think we could make some shared function for applying EnqueueLinksKwargs to a bunch of requests?

I totally agree!

@Pijukatel Pijukatel merged commit c9907d6 into apify:master May 27, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PlaywrightCrawler extract_links does not respect strategy

3 participants