You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This check seems intended to check URL validity and clean it, including by following redirects.
It is however doing some harm: since the request is done by Python requests library, the anti-bot protection are regularly triggered.
See #255 for instance where removing the check_url (manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened in check_url.
We tried to enhance the situation with #229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).
I'm not sure how to move this forward, but clearly there is something to do.
I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?
Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.
Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.
The text was updated successfully, but these errors were encountered:
This might be solved with the upgrade to 1.0.0-beta5 where we will probably be able to remove the check_url operation since the redirect thing will be handled by browsertrix now and will be considered as a seed (and hence not suffer from scope issues).
The only thing we probably keep is the removal of default 443 and 80 ports.
To be confirmed in the PR for upgrading to 1.0.0-beta5
In zimit, at the beginning of scraper execution it performs what is named a
check_url
:zimit/zimit.py
Lines 467 to 508 in a62f31e
This check seems intended to check URL validity and clean it, including by following redirects.
It is however doing some harm: since the request is done by Python
requests
library, the anti-bot protection are regularly triggered.See #255 for instance where removing the
check_url
(manually on my machine) allows Browsertrix to proceed (even if I'm not sure it will finish, protections might stop us at some points). Same problem occurs in #232. And we have many cases reported in the weekly routine where the youzim.it task is stopped by a Python error, i.e. something which happened incheck_url
.We tried to enhance the situation with #229 and while it is way better now, it is still not sufficient. Advanced anti-bot protections are not tricked by the agent and they still identifies us as a bot (probably via TLS fingerprinting techniques).
I'm not sure how to move this forward, but clearly there is something to do.
I wonder if we should simply remove this check of URLs, it seems to me this is doing more harm than good and it is the user responsibility to input proper URLs. Do we have any notes / remembering of why this was introduced in details?
Note that doing the check and just not caring about errors returns is not sufficient, since doing the check usually trigger a temporary ban of our scraper IP.
Another option would be to introduce a CLI flag to optionally disable this check, but I feel like this scraper already has too many flags, and on youzim.it it will be hard for the end-user to know he should disable this check. And if he just ran the scraper with the check, the IP might be banned and he will have to wait (without really knowing about it) before running the scraper without the check.
The text was updated successfully, but these errors were encountered: