Skip to content

Conversation

@Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented Mar 10, 2025

Description

Make browserfore dependency optional.
Add temporary fallback header generator for HttpxHttpClient.

This is temporary solution until browserforge is updated to avoid runtime downloads (daijro/browserforge#29) and until crawlee is split into smaller more dedicated packages.

Issues

Add temporary fallback header generator for Httpx client.
@github-actions github-actions bot added this to the 110th sprint - Tooling team milestone Mar 10, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 10, 2025
@Pijukatel Pijukatel added the adhoc Ad-hoc unplanned task added during the sprint. label Mar 10, 2025
@Pijukatel Pijukatel marked this pull request as ready for review March 10, 2025 13:16
@Pijukatel Pijukatel requested review from janbuchar and vdusek March 10, 2025 13:16
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we do not need the PW-related stuff, correct?

Regarding the user-agents:

Random 1000 user agents from Apify fingerprint dataset.

Since this is just a temporary fix, perhaps we could include e.g. only the 10 most common UAs from the list instead (now it is a subset of the dataset).

@Pijukatel
Copy link
Collaborator Author

I believe we do not need the PW-related stuff, correct?

Regarding the user-agents:

Random 1000 user agents from Apify fingerprint dataset.

Since this is just a temporary fix, perhaps we could include e.g. only the 10 most common UAs from the list instead (now it is a subset of the dataset).

What is the benefit of that?
Including smaller set would downgrade the HttpxClient user agent header diversity compared to not only browserforge based headers, but also compared to the previous header generator. Maintenance cost is the same for 1000 or 10 samples. The only difference is in the size of the file - which is negligible compared to the size of the repo.

from crawlee._utils.docs import docs_group
from crawlee.errors import ProxyError
from crawlee.fingerprint_suite import HeaderGenerator
from crawlee.fingerprint_suite._fallback_header_generator import HeaderGenerator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we instead import the real-deal browserforge-based generator when it's actually about to be used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that having runtime conditional imports is correct approach. As far as I see it the correct approach is fixing the browserforge to not execute code on import time and then split the repo into smaller pieces so that sdk does not need to import BasicCrawler -> HttpxClient -> browserforge, when it is not even using BasicCrawler.

When code is using BasicCrawler, then importing browserforge is legitimate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct in what sense? If we can't control when browserforge stops fetching stuff on import time, then I believe importing conditionally (temporarily, until browserforge knows better) is slightly better than having a different implementation of HeaderGenerator in the httpx client.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per discussion on Monday I understood there are two separate issues:

  1. browserforge doing stuff on import time
  2. The fact that we need browserforge when pip install crawlee (without extras)

The open PR in browserforge repo is addressing the first point.
This PR is addressing the second point and is needed regardless of the first point (until we split the repo into smaller pieces).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, as long as the difference in behavior for httpx client is temporary, I guess this is OK. But, the lazy import would be easier to revert once browserforge stops running stuff on import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I am not enthusiastic about this change either and I would rather not have it, but this was agreed on the meeting.

Anyway. Browserforge PR does not have any activity. This is proposal workaround until that PR is merged:
#1073

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of that?
Including smaller set would downgrade the HttpxClient user agent header diversity compared to not only browserforge based headers, but also compared to the previous header generator. Maintenance cost is the same for 1000 or 10 samples. The only difference is in the size of the file - which is negligible compared to the size of the repo.

There are 1000 samples, but only ~200 are unique. We could use the X most common ones with even probability, achieving nearly the same result with 90% fewer lines of code.

Also, the Playwright headers aren't used at all.

However, considering it's a tmp fix, I won't discuss it further.

Also, it should be a fix, shouldn't it?

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented Mar 11, 2025

Also, it should be a fix, shouldn't it?

Well, it is fix from apify-sdk-python point of view, but it is just refactor from crawlee-python. From crawlee-python point of view, nothing was broken.

@janbuchar janbuchar changed the title refactor: Make browserfore dependency optional refactor: Make browserforge dependency optional Mar 11, 2025
@Pijukatel
Copy link
Collaborator Author

Closed in favor of: #1075

@Pijukatel Pijukatel closed this Mar 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants