Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow all URLs to be crawlable via robots.txt #2107

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

acelaya
Copy link
Member

@acelaya acelaya commented Apr 21, 2024

Closes #2108

As discussed in #2067 (reply in thread), this PR adds an option to make Shlink return a robots.txt that allows all URLs to be crawlable, except rest ones.

The option is disabled by default, for backwards compatibility.

Copy link

codecov bot commented Apr 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.11%. Comparing base (986f116) to head (163244f).
Report is 4 commits behind head on develop.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #2107   +/-   ##
==========================================
  Coverage      96.10%   96.11%           
- Complexity      1423     1424    +1     
==========================================
  Files            263      263           
  Lines           5113     5116    +3     
==========================================
+ Hits            4914     4917    +3     
  Misses           199      199           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@acelaya acelaya marked this pull request as ready for review April 22, 2024 07:17
@acelaya acelaya merged commit 59fa088 into shlinkio:develop Apr 22, 2024
23 checks passed
@acelaya acelaya deleted the feature/robots-allow-all branch April 22, 2024 07:23
@dhow
Copy link

dhow commented May 24, 2024

@acelaya -- I was looking reading your change and tried adding env variable ROBOTS_ALLOW_ALL=TRUE and ROBOTS_ALLOW_ALL_SHORT_URLS=TRUE but I didn't see difference in generated robots.txt. Running 4.1.1 docker container. I might have missed something, but what is the expected behavior (in robots.txt)?

:~$ curl https://xxx/robots.txt # ROBOTS_ALLOW_ALL_SHORT_URLS=TRUE
User-agent: *
Disallow: /
:~$ curl https://xxx/robots.txt # ROBOTS_ALLOW_ALL_SHORT_URLS=FALSE
User-agent: *
Disallow: /
:~$ curl https://xxx/robots.txt # ROBOTS_ALLOW_ALL=TRUE
User-agent: *
Disallow: /
:~$ curl https://xxx/robots.txt # ROBOTS_ALLOW_ALL=FALSE
User-agent: *
Disallow: /

FYI, for Facebook link validation (unfortunately you need a Facebook account to try... 😄 ) I'm currently using following patch on /etc/shlink/module/Core/src/Action/RobotsAction.php :

--- RobotsAction.php    2024-04-14 16:13:41.000000000 +0900
+++ RobotsAction.php-modified   2024-04-19 21:11:37.891032753 +0900
@@ -33,6 +33,9 @@
         # For more information about the robots.txt standard, see:
         # https://www.robotstxt.org/orig.html
 
+        User-agent: facebookexternalhit
+        Disallow: 
+
         User-agent: *
 
         ROBOTS;

@acelaya
Copy link
Member Author

acelaya commented May 24, 2024

This feature is not yet released. It will ship with v4.2.0

@dhow
Copy link

dhow commented May 24, 2024

Roger @acelaya !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to allow all short URLs to be crawlable
2 participants