-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bypass bot detectors #166
Comments
Les trois ne fonctionnent pas (403) (problème connu OpenTermsArchive#166)
Hi, I hope using a headless browser will fix this. So I suggest to wait for #138 to be implemented and see if there is still this issue. |
Using a headless browser is not enough to fix this. You have to disguise it (https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth for instance) and you are still identified by your IP address (DataDome used on Leboncoin for instance does this), if you are connecting from a server infrastructure (not residential). |
As discussed with @LucasVerneyDGE and @TomHouriezDGE, this option will be needed for some sources, even after #138 is fixed. However, it also raises legal questions. @LucasVerneyDGE will investigate which entities might have power to legally bypass access control systems, and we will design the most appropriate software architecture (opt-in, opt-out, plugin) based on the legal assessment 🙂 |
Hi all jumping back on this matter as we encounter it more and more often One of the common issues we find is being confronted to a 403 due to Web Application Firewall (WAF) We already encountered 3 of them with
@LVerneyPEReN do you have any news? |
Legal analysis by PEReN was still pending on 08/03/2022. Imperva and Cloudflare answers are still pending. In order to help with prioritisation, instead of listing issues in this repository, they are now labeled in each affected instance with dedicated tags (403, timeout…). |
@LVerneyPEReN did the PEReN finish its legal analysis? 🙂 On our side, I believe we never got a reply from Imperva nor Cloudflare (please correct me if I'm wrong @martinratinaud). |
Indeed, we did not 😔 |
Cloudflare maintains a list of verified bots. They state “Cloudflare manually approves well-behaved services that benefit the broader Internet and honor robots.txt.” There is on this page a link to “add a bot” that requires having a Cloudflare account. |
Hi,
Rakuten and Leboncoin have very strong bot detectors, hence preventing from automatically fetching their CGUs (at least on a regular OVH machine). See https://fr.shopping.rakuten.com/newhelp/conditions-generales/ or https://www.leboncoin.fr/dc/cgu. It is possible that #138 and having JS enabled will help here, but I think this won't be enough.
Best,
EDIT: Same for RueDuCommerce (see https://www.rueducommerce.fr/info/mentions-legales/cgv) or FNAC (https://www.fnac.com/Help/cgv-fnac#bl=footer), they all use the same system, powered by Datadome.
The text was updated successfully, but these errors were encountered: