-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does blocking facebookexternalhit also break sharing to social media? #40
Comments
It will, per Meta's docs:
I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler. |
Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available? |
I've noticed that blocking |
facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot? |
Is this not it? Line 36 in b1491d2
(if not, what's that list for?) |
Ah yes, you're right. I made the mistake of thinking the list was in
alphabetical order. Apologies.
…On Wed, 16 Oct 2024, 11:35 Nicholas Thompson, ***@***.***> wrote:
facebookexternalhit isn't in our list of AI crawlers, so isn't this
discussion a bit moot?
Is this not it?
https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36
(if not, what's that list for?)
—
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAXF2OHKT6RNN5AHFC7NY3Z3Y6PTAVCNFSM6AAAAABOWU5CTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGQYTMMBXGY>
.
You are receiving this because you commented.Message ID: <ai-robots-txt/ai
.***@***.***>
|
It's probably worth alphabetising them; as the list grows, duplicates are more likely. Could be a github pre-commit command that sorts / uniques the list? 🤷🏻♂️ EDIT: I realise this is massively off topic, though. Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)? |
I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to. |
This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards. Maybe the solution here is a comment above it describing what it does and what the risks are? Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server. |
I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary? (I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!) |
ai.robots.txt/robots.txt
Line 33 in 6b8d7f5
If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?
Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)
The text was updated successfully, but these errors were encountered: