refactor: Make `browserforge` dependency optional #1067

Pijukatel · 2025-03-10T13:12:16Z

Description

Make browserfore dependency optional.
Add temporary fallback header generator for HttpxHttpClient.

This is temporary solution until browserforge is updated to avoid runtime downloads (daijro/browserforge#29) and until crawlee is split into smaller more dedicated packages.

Issues

Relates to: Logs from browserforge

Add temporary fallback header generator for Httpx client.

vdusek

I believe we do not need the PW-related stuff, correct?

Regarding the user-agents:

Random 1000 user agents from Apify fingerprint dataset.

Since this is just a temporary fix, perhaps we could include e.g. only the 10 most common UAs from the list instead (now it is a subset of the dataset).

Pijukatel · 2025-03-10T14:37:39Z

I believe we do not need the PW-related stuff, correct?

Regarding the user-agents:

Random 1000 user agents from Apify fingerprint dataset.

Since this is just a temporary fix, perhaps we could include e.g. only the 10 most common UAs from the list instead (now it is a subset of the dataset).

What is the benefit of that?
Including smaller set would downgrade the HttpxClient user agent header diversity compared to not only browserforge based headers, but also compared to the previous header generator. Maintenance cost is the same for 1000 or 10 samples. The only difference is in the size of the file - which is negligible compared to the size of the repo.

janbuchar · 2025-03-10T15:14:07Z

src/crawlee/http_clients/_httpx.py

 from crawlee._utils.docs import docs_group
 from crawlee.errors import ProxyError
-from crawlee.fingerprint_suite import HeaderGenerator
+from crawlee.fingerprint_suite._fallback_header_generator import HeaderGenerator


Can't we instead import the real-deal browserforge-based generator when it's actually about to be used?

I do not think that having runtime conditional imports is correct approach. As far as I see it the correct approach is fixing the browserforge to not execute code on import time and then split the repo into smaller pieces so that sdk does not need to import BasicCrawler -> HttpxClient -> browserforge, when it is not even using BasicCrawler.

When code is using BasicCrawler, then importing browserforge is legitimate.

Correct in what sense? If we can't control when browserforge stops fetching stuff on import time, then I believe importing conditionally (temporarily, until browserforge knows better) is slightly better than having a different implementation of HeaderGenerator in the httpx client.

As per discussion on Monday I understood there are two separate issues:

browserforge doing stuff on import time

The fact that we need browserforge when pip install crawlee (without extras)

The open PR in browserforge repo is addressing the first point.
This PR is addressing the second point and is needed regardless of the first point (until we split the repo into smaller pieces).

I mean, as long as the difference in behavior for httpx client is temporary, I guess this is OK. But, the lazy import would be easier to revert once browserforge stops running stuff on import.

I agree. I am not enthusiastic about this change either and I would rather not have it, but this was agreed on the meeting.

Anyway. Browserforge PR does not have any activity. This is proposal workaround until that PR is merged:
#1073

vdusek

What is the benefit of that?
Including smaller set would downgrade the HttpxClient user agent header diversity compared to not only browserforge based headers, but also compared to the previous header generator. Maintenance cost is the same for 1000 or 10 samples. The only difference is in the size of the file - which is negligible compared to the size of the repo.

There are 1000 samples, but only ~200 are unique. We could use the X most common ones with even probability, achieving nearly the same result with 90% fewer lines of code.

Also, the Playwright headers aren't used at all.

However, considering it's a tmp fix, I won't discuss it further.

Also, it should be a fix, shouldn't it?

Pijukatel · 2025-03-11T10:00:50Z

Also, it should be a fix, shouldn't it?

Well, it is fix from apify-sdk-python point of view, but it is just refactor from crawlee-python. From crawlee-python point of view, nothing was broken.

Pijukatel · 2025-03-17T07:33:34Z

Closed in favor of: #1075

Make browserfore dependency optional.

694bc36

Add temporary fallback header generator for Httpx client.

github-actions bot assigned Pijukatel Mar 10, 2025

github-actions bot added this to the 110th sprint - Tooling team milestone Mar 10, 2025

github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Mar 10, 2025

Pijukatel added the adhoc Ad-hoc unplanned task added during the sprint. label Mar 10, 2025

Pijukatel marked this pull request as ready for review March 10, 2025 13:16

Pijukatel requested review from janbuchar and vdusek March 10, 2025 13:16

vdusek requested changes Mar 10, 2025

View reviewed changes

janbuchar reviewed Mar 10, 2025

View reviewed changes

Move fallabck generator related code out of fingerprint_suite

ae9be3b

vdusek reviewed Mar 11, 2025

View reviewed changes

janbuchar changed the title ~~refactor: Make browserfore dependency optional~~ refactor: Make browserforge dependency optional Mar 11, 2025

vdusek mentioned this pull request Mar 12, 2025

fix: Temporary workaround for browserforge import time code execution #1073

Merged

Pijukatel closed this Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Make `browserforge` dependency optional #1067

refactor: Make `browserforge` dependency optional #1067

Uh oh!

Pijukatel commented Mar 10, 2025 •

edited

Loading

Uh oh!

vdusek left a comment •

edited

Loading

Uh oh!

Pijukatel commented Mar 10, 2025

Uh oh!

janbuchar Mar 10, 2025

Uh oh!

Pijukatel Mar 11, 2025

Uh oh!

janbuchar Mar 11, 2025

Uh oh!

Pijukatel Mar 11, 2025

Uh oh!

janbuchar Mar 11, 2025

Uh oh!

Pijukatel Mar 11, 2025

Uh oh!

vdusek left a comment •

edited

Loading

Uh oh!

Pijukatel commented Mar 11, 2025 •

edited

Loading

Uh oh!

Pijukatel commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: Make browserforge dependency optional #1067

refactor: Make browserforge dependency optional #1067

Uh oh!

Conversation

Pijukatel commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Mar 10, 2025

Uh oh!

janbuchar Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel commented Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: Make `browserforge` dependency optional #1067

refactor: Make `browserforge` dependency optional #1067

Pijukatel commented Mar 10, 2025 •

edited

Loading

vdusek left a comment •

edited

Loading

vdusek left a comment •

edited

Loading

Pijukatel commented Mar 11, 2025 •

edited

Loading