Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does blocking facebookexternalhit also break sharing to social media? #40

Open
njt1982 opened this issue Sep 23, 2024 · 10 comments
Open

Comments

@njt1982
Copy link

njt1982 commented Sep 23, 2024

User-agent: facebookexternalhit

If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?

Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)

@cdransf
Copy link
Member

cdransf commented Sep 28, 2024

It will, per Meta's docs:

The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.

I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler.

@glyn
Copy link
Contributor

glyn commented Sep 29, 2024

Facebook sends that bot to get the Open Graph data for things like title and image for the post

Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available?

@paulrudy
Copy link

I've noticed that blocking facebookexternalhit prevents rich links (cards) from displaying in Apple Messages and Apple Mail (iOS and macOS). The change takes place immediately: blocking facebookexternalhit immediately prevents adding rich link, allowing immediately permits adding a rich link.

@glyn
Copy link
Contributor

glyn commented Oct 16, 2024

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

@njt1982
Copy link
Author

njt1982 commented Oct 16, 2024

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it?

User-agent: facebookexternalhit

(if not, what's that list for?)

@glyn
Copy link
Contributor

glyn commented Oct 16, 2024 via email

@njt1982
Copy link
Author

njt1982 commented Oct 16, 2024

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

It's probably worth alphabetising them; as the list grows, duplicates are more likely.

Could be a github pre-commit command that sorts / uniques the list? 🤷🏻‍♂️

EDIT: I realise this is massively off topic, though.

Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?

@glyn
Copy link
Contributor

glyn commented Oct 17, 2024

I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to.

@njt1982
Copy link
Author

njt1982 commented Oct 17, 2024

This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards.

Maybe the solution here is a comment above it describing what it does and what the risks are?

Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server.

glyn added a commit to glyn/ai.robots.txt that referenced this issue Oct 17, 2024
@glyn
Copy link
Contributor

glyn commented Oct 17, 2024

Maybe the solution here is a comment above it describing what it does and what the risks are?

I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary?

(I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants