Does blocking facebookexternalhit also break sharing to social media? #40

njt1982 · 2024-09-23T18:29:32Z

ai.robots.txt/robots.txt

Line 33 in 6b8d7f5

User-agent: facebookexternalhit

If we block this in robots.txt, will we affect the functionality for when URLs are shared to facebook from the site and Facebook sends that bot to get the Open Graph data for things like title and image for the post?

Ideally I'd like to block the bot from crawling / DoS'ing the site but still allow on-demand/cached page requests for OG data when a post is shared. Facebook does not need to crawl an entire site! :)

cdransf · 2024-09-28T21:01:46Z

It will, per Meta's docs:

The primary purpose of FacebookExternalHit is to crawl the content of an app or website that was shared on one of Meta’s family of apps, such as Facebook, Instagram, or Messenger.

I'm ok blocking Meta's products generally, but I'd weigh how important that functionality is against the traffic that you're seeing from that crawler.

glyn · 2024-09-29T08:51:38Z

Facebook sends that bot to get the Open Graph data for things like title and image for the post

Although I don't use Facebook, I'd be surprised if sharing a link on Facebook required the crawler to run. Crawlers typically run on their own schedule rather than synchronously to a user action such as creating a post. Have you experimented by monitoring access of your website by such a crawler and then posting on Facebook and seeing if a "crawl" occurs before the post is available?

paulrudy · 2024-10-14T02:38:50Z

I've noticed that blocking facebookexternalhit prevents rich links (cards) from displaying in Apple Messages and Apple Mail (iOS and macOS). The change takes place immediately: blocking facebookexternalhit immediately prevents adding rich link, allowing immediately permits adding a rich link.

glyn · 2024-10-16T10:30:32Z

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

njt1982 · 2024-10-16T10:35:15Z

facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot?

Is this not it?

ai.robots.txt/robots.txt

Line 36 in b1491d2

User-agent: facebookexternalhit

(if not, what's that list for?)

glyn · 2024-10-16T10:54:39Z

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

…

On Wed, 16 Oct 2024, 11:35 Nicholas Thompson, ***@***.***> wrote: facebookexternalhit isn't in our list of AI crawlers, so isn't this discussion a bit moot? Is this not it? https://github.com/ai-robots-txt/ai.robots.txt/blob/b1491d269460ca57581c2df7cf14b3f3fc4749f3/robots.txt#L36 (if not, what's that list for?) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAXF2OHKT6RNN5AHFC7NY3Z3Y6PTAVCNFSM6AAAAABOWU5CTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGQYTMMBXGY> . You are receiving this because you commented.Message ID: <ai-robots-txt/ai .***@***.***>

njt1982 · 2024-10-16T16:33:45Z

Ah yes, you're right. I made the mistake of thinking the list was in alphabetical order. Apologies.

It's probably worth alphabetising them; as the list grows, duplicates are more likely.

Could be a github pre-commit command that sorts / uniques the list? 🤷🏻‍♂️

EDIT: I realise this is massively off topic, though.

Back to the thread point... isn't recommending blocking this bot a risky move as it could cause websites to lose rich social media embedding (eg image and other OpenGraph data)?

glyn · 2024-10-17T08:47:13Z

I would say the priority of this project is to block AI crawlers. It is not clear that facebookexternalhit gathers data for AI training, but we don't know it isn't either. I would personally vote to keep this in the list. The possible downsides seem negligible from my perspective. Any websites that really need not to block that user agent don't have to.

njt1982 · 2024-10-17T09:52:26Z

This particular one is a tricky one as it is a very aggressive crawler... but, unlike some of them, a lot of our customers would likely notice if their website suddenly stopped displaying article image cards.

Maybe the solution here is a comment above it describing what it does and what the risks are?

Some of these will have a much lower risk profile... But on the flip side items like this (and potentially the Google ones, too) might have unintended impact for site SEO and Social exposure for those simply copy-pasting a list in to their site to try to stop these bots from taking down the server.

Ref: ai-robots-txt#40 (comment)

glyn · 2024-10-17T11:32:11Z

Maybe the solution here is a comment above it describing what it does and what the risks are?

I like the sound of this. Would it be possible to demonstrate that these risks are real and not imaginary?

(I submitted a PR to extend the FAQ to take into account your "taking down the server" point - thanks!)

glyn mentioned this issue Oct 17, 2024

Order entries alphanumerically and case-insensitively #44

Closed

glyn added a commit to glyn/ai.robots.txt that referenced this issue Oct 17, 2024

Augment the "why" FAQ

e6bb7ca

Ref: ai-robots-txt#40 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does blocking facebookexternalhit also break sharing to social media? #40

Does blocking facebookexternalhit also break sharing to social media? #40

njt1982 commented Sep 23, 2024

cdransf commented Sep 28, 2024

glyn commented Sep 29, 2024

paulrudy commented Oct 14, 2024

glyn commented Oct 16, 2024

njt1982 commented Oct 16, 2024

glyn commented Oct 16, 2024 via email

njt1982 commented Oct 16, 2024 •

edited

Loading

glyn commented Oct 17, 2024

njt1982 commented Oct 17, 2024

glyn commented Oct 17, 2024

Does blocking facebookexternalhit also break sharing to social media? #40

Does blocking facebookexternalhit also break sharing to social media? #40

Comments

njt1982 commented Sep 23, 2024

cdransf commented Sep 28, 2024

glyn commented Sep 29, 2024

paulrudy commented Oct 14, 2024

glyn commented Oct 16, 2024

njt1982 commented Oct 16, 2024

glyn commented Oct 16, 2024 via email

njt1982 commented Oct 16, 2024 • edited Loading

glyn commented Oct 17, 2024

njt1982 commented Oct 17, 2024

glyn commented Oct 17, 2024

njt1982 commented Oct 16, 2024 •

edited

Loading