-
Notifications
You must be signed in to change notification settings - Fork 540
refactor: Make browserforge dependency optional
#1067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add temporary fallback header generator for Httpx client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we do not need the PW-related stuff, correct?
Regarding the user-agents:
Random 1000 user agents from Apify fingerprint dataset.
Since this is just a temporary fix, perhaps we could include e.g. only the 10 most common UAs from the list instead (now it is a subset of the dataset).
What is the benefit of that? |
src/crawlee/http_clients/_httpx.py
Outdated
| from crawlee._utils.docs import docs_group | ||
| from crawlee.errors import ProxyError | ||
| from crawlee.fingerprint_suite import HeaderGenerator | ||
| from crawlee.fingerprint_suite._fallback_header_generator import HeaderGenerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we instead import the real-deal browserforge-based generator when it's actually about to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that having runtime conditional imports is correct approach. As far as I see it the correct approach is fixing the browserforge to not execute code on import time and then split the repo into smaller pieces so that sdk does not need to import BasicCrawler -> HttpxClient -> browserforge, when it is not even using BasicCrawler.
When code is using BasicCrawler, then importing browserforge is legitimate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct in what sense? If we can't control when browserforge stops fetching stuff on import time, then I believe importing conditionally (temporarily, until browserforge knows better) is slightly better than having a different implementation of HeaderGenerator in the httpx client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per discussion on Monday I understood there are two separate issues:
- browserforge doing stuff on import time
- The fact that we need browserforge when
pip install crawlee(without extras)
The open PR in browserforge repo is addressing the first point.
This PR is addressing the second point and is needed regardless of the first point (until we split the repo into smaller pieces).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, as long as the difference in behavior for httpx client is temporary, I guess this is OK. But, the lazy import would be easier to revert once browserforge stops running stuff on import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I am not enthusiastic about this change either and I would rather not have it, but this was agreed on the meeting.
Anyway. Browserforge PR does not have any activity. This is proposal workaround until that PR is merged:
#1073
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of that?
Including smaller set would downgrade the HttpxClient user agent header diversity compared to not only browserforge based headers, but also compared to the previous header generator. Maintenance cost is the same for 1000 or 10 samples. The only difference is in the size of the file - which is negligible compared to the size of the repo.
There are 1000 samples, but only ~200 are unique. We could use the X most common ones with even probability, achieving nearly the same result with 90% fewer lines of code.
Also, the Playwright headers aren't used at all.
However, considering it's a tmp fix, I won't discuss it further.
Also, it should be a fix, shouldn't it?
Well, it is |
browserfore dependency optionalbrowserforge dependency optional
|
Closed in favor of: #1075 |
Description
Make
browserforedependency optional.Add temporary fallback header generator for
HttpxHttpClient.This is temporary solution until browserforge is updated to avoid runtime downloads (daijro/browserforge#29) and until
crawleeis split into smaller more dedicated packages.Issues