Skip to content

Commit 9b90e7a

Browse files
committed
Serve llms.txt in proxito
1 parent 4e3c82a commit 9b90e7a

File tree

12 files changed

+252
-0
lines changed

12 files changed

+252
-0
lines changed

docs/user/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Read the Docs: documentation simplified
5151
/reference/cdn
5252
/reference/sitemaps
5353
/reference/404-not-found
54+
/reference/llms
5455
/reference/robots
5556

5657
.. toctree::

docs/user/reference/features.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,11 @@ Feature reference
5959
We provide a default 404 page,
6060
but you can also customize it.
6161

62+
⏩️ :doc:`/reference/llms`
63+
``llms.txt`` files communicate expectations to LLM-focused crawlers.
64+
We provide a default file,
65+
but you can also customize it.
66+
6267
⏩️ :doc:`/reference/robots`
6368
`robots.txt` files allow you to customize how your documentation is indexed in search engines.
6469
We provide a default robots.txt file,

docs/user/reference/llms.rst

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
``llms.txt`` support
2+
====================
3+
4+
The `llms.txt` files describe how large language model crawlers can use your documentation.
5+
They're useful for:
6+
7+
* Signaling which parts of your site should be avoided by AI-focused crawlers.
8+
* Documenting how models can attribute your content.
9+
* Sharing links (like a sitemap) that help LLM-powered crawlers discover content responsibly.
10+
11+
Read the Docs automatically generates one for you with a configuration that works for most projects.
12+
By default, the automatically created ``llms.txt``:
13+
14+
* Hides versions which are set to :ref:`Hidden <versions:Version states>` from being indexed by LLM crawlers.
15+
* Allows crawling of all other versions.
16+
17+
.. warning::
18+
19+
``llms.txt`` files are a signal to cooperating crawlers,
20+
but they aren't a guarantee that your pages will not be ingested.
21+
If you require *private* documentation, please see :doc:`/commercial/sharing`.
22+
23+
How it works
24+
------------
25+
26+
You can customize this file to add more rules to it.
27+
The ``llms.txt`` file will be served from the **default version** of your project.
28+
This is because the ``llms.txt`` file is served at the top-level of your domain,
29+
so we must choose a version to find the file in.
30+
The **default version** is the best place to look for it.
31+
32+
Tool integration
33+
----------------
34+
35+
Documentation tools will have different ways of generating an ``llms.txt`` file.
36+
We have examples for some of the most popular tools below.
37+
38+
.. tabs::
39+
40+
.. tab:: Sphinx
41+
42+
Sphinx uses the `html_extra_path`_ configuration value to add static files to its final HTML output.
43+
You need to create a ``llms.txt`` file and put it under the path defined in ``html_extra_path``.
44+
45+
.. tab:: MkDocs
46+
47+
MkDocs needs the ``llms.txt`` to be at the directory defined by the `docs_dir`_ configuration value.
48+
49+
.. _html_extra_path: https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-html_extra_path
50+
.. _docs_dir: https://www.mkdocs.org/user-guide/configuration/#docs_dir
51+
52+
.. seealso::
53+
54+
:doc:`/reference/robots`

readthedocs/proxito/README.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,8 @@ What can/can't be cached?
144144

145145
- ServeRobotsTXT: can be cached, we don't serve a custom robots.txt
146146
to any user if the default version is private.
147+
- ServeLLMSTXT: can be cached, we don't serve a custom llms.txt
148+
to any user if the default version is private.
147149
- ServeSitemapXML: can be cached. It displays only public versions, for everyone.
148150
- ServeStaticFiles: can be cached, all files are the same for all projects and users.
149151
- Embed API: can be cached for public versions.

readthedocs/proxito/tests/base.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,24 @@
66
from django.contrib.auth.models import User
77
from readthedocs.storage import get_storage_class
88
from django.test import TestCase
9+
from django.test.utils import override_settings
910

1011
from readthedocs.builds.constants import LATEST
1112
from readthedocs.projects.constants import PUBLIC, SSL_STATUS_VALID
1213
from readthedocs.projects.models import Domain, Project
1314
from readthedocs.proxito.views import serve
1415

1516

17+
proxito_middleware = list(settings.MIDDLEWARE) + [
18+
"readthedocs.proxito.middleware.ProxitoMiddleware",
19+
]
20+
21+
1622
@pytest.mark.proxito
23+
@override_settings(ROOT_URLCONF="readthedocs.proxito.urls", MIDDLEWARE=proxito_middleware)
1724
class BaseDocServing(TestCase):
25+
urls = "readthedocs.proxito.urls"
26+
1827
def setUp(self):
1928
# Re-initialize storage
2029
# Various tests override either this setting or various aspects of the storage engine

readthedocs/proxito/tests/test_full.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -952,6 +952,44 @@ def test_custom_robots_txt_private_version(self):
952952
)
953953
self.assertEqual(response.status_code, 404)
954954

955+
@mock.patch.object(BuildMediaFileSystemStorageTest, "exists")
956+
def test_default_llms_txt(self, storage_exists):
957+
storage_exists.return_value = False
958+
self.project.versions.update(active=True, built=True)
959+
response = self.client.get(
960+
reverse("llms_txt"), headers={"host": "project.readthedocs.io"}
961+
)
962+
self.assertEqual(response.status_code, 200)
963+
expected = dedent(
964+
"""
965+
User-agent: *
966+
967+
Disallow: # Allow everything
968+
969+
Sitemap: https://project.readthedocs.io/sitemap.xml
970+
"""
971+
).lstrip()
972+
self.assertContains(response, expected)
973+
974+
def test_custom_llms_txt(self):
975+
self.project.versions.update(active=True, built=True)
976+
response = self.client.get(
977+
reverse("llms_txt"), headers={"host": "project.readthedocs.io"}
978+
)
979+
self.assertEqual(
980+
response["x-accel-redirect"],
981+
"/proxito/media/html/project/latest/llms.txt",
982+
)
983+
984+
def test_custom_llms_txt_private_version(self):
985+
self.project.versions.update(
986+
active=True, built=True, privacy_level=constants.PRIVATE
987+
)
988+
response = self.client.get(
989+
reverse("llms_txt"), headers={"host": "project.readthedocs.io"}
990+
)
991+
self.assertEqual(response.status_code, 404)
992+
955993
def test_directory_indexes(self):
956994
self.project.versions.update(active=True, built=True)
957995

readthedocs/proxito/tests/test_headers.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -346,6 +346,15 @@ def test_cache_headers_robots_txt_with_private_projects_not_allowed(self):
346346
self.assertEqual(r["CDN-Cache-Control"], "public")
347347
self.assertEqual(r["Cache-Tag"], "project,project:robots.txt")
348348

349+
@override_settings(ALLOW_PRIVATE_REPOS=False)
350+
def test_cache_headers_llms_txt_with_private_projects_not_allowed(self):
351+
r = self.client.get(
352+
"/llms.txt", secure=True, headers={"host": "project.dev.readthedocs.io"}
353+
)
354+
self.assertEqual(r.status_code, 200)
355+
self.assertEqual(r["CDN-Cache-Control"], "public")
356+
self.assertEqual(r["Cache-Tag"], "project,project:llms.txt")
357+
349358
@override_settings(ALLOW_PRIVATE_REPOS=True)
350359
def test_cache_headers_robots_txt_with_private_projects_allowed(self):
351360
r = self.client.get(
@@ -355,6 +364,15 @@ def test_cache_headers_robots_txt_with_private_projects_allowed(self):
355364
self.assertEqual(r["CDN-Cache-Control"], "public")
356365
self.assertEqual(r["Cache-Tag"], "project,project:robots.txt")
357366

367+
@override_settings(ALLOW_PRIVATE_REPOS=True)
368+
def test_cache_headers_llms_txt_with_private_projects_allowed(self):
369+
r = self.client.get(
370+
"/llms.txt", secure=True, headers={"host": "project.dev.readthedocs.io"}
371+
)
372+
self.assertEqual(r.status_code, 200)
373+
self.assertEqual(r["CDN-Cache-Control"], "public")
374+
self.assertEqual(r["Cache-Tag"], "project,project:llms.txt")
375+
358376
@override_settings(ALLOW_PRIVATE_REPOS=False)
359377
def test_cache_headers_robots_txt_with_private_projects_not_allowed(self):
360378
r = self.client.get(

readthedocs/proxito/urls.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
from readthedocs.proxito.views.hosting import ReadTheDocsConfigJson
4848
from readthedocs.proxito.views.serve import ServeDocs
4949
from readthedocs.proxito.views.serve import ServeError404
50+
from readthedocs.proxito.views.serve import ServeLLMSTXT
5051
from readthedocs.proxito.views.serve import ServePageRedirect
5152
from readthedocs.proxito.views.serve import ServeRobotsTXT
5253
from readthedocs.proxito.views.serve import ServeSitemapXML
@@ -133,6 +134,7 @@
133134
name="proxito_404_handler",
134135
),
135136
re_path(r"robots\.txt$", ServeRobotsTXT.as_view(), name="robots_txt"),
137+
re_path(r"llms\.txt$", ServeLLMSTXT.as_view(), name="llms_txt"),
136138
re_path(r"sitemap\.xml$", ServeSitemapXML.as_view(), name="sitemap_xml"),
137139
]
138140

readthedocs/proxito/views/serve.py

Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -736,6 +736,107 @@ class ServeRobotsTXT(SettingsOverrideObject):
736736
_default_class = ServeRobotsTXTBase
737737

738738

739+
class ServeLLMSTXTBase(CDNCacheControlMixin, CDNCacheTagsMixin, ServeDocsMixin, View):
740+
"""Serve llms.txt from the domain's root."""
741+
742+
cache_response = True
743+
project_cache_tag = "llms.txt"
744+
745+
def get(self, request):
746+
"""
747+
Serve custom user's defined ``/llms.txt``.
748+
749+
If the project is delisted or is a spam project, we force a special llms.txt.
750+
751+
If the user added a ``llms.txt`` in the "default version" of the
752+
project, we serve it directly.
753+
"""
754+
755+
project = request.unresolved_domain.project
756+
757+
if project.delisted:
758+
return render(
759+
request,
760+
"llms.delisted.txt",
761+
content_type="text/plain",
762+
)
763+
764+
if "readthedocsext.spamfighting" in settings.INSTALLED_APPS:
765+
from readthedocsext.spamfighting.utils import is_robotstxt_denied # noqa
766+
767+
if is_robotstxt_denied(project):
768+
return render(
769+
request,
770+
"llms.spam.txt",
771+
content_type="text/plain",
772+
)
773+
774+
version_slug = project.get_default_version()
775+
version = project.versions.get(slug=version_slug)
776+
777+
no_serve_llms_txt = any(
778+
[
779+
version.privacy_level == PRIVATE,
780+
not version.active,
781+
not version.built,
782+
]
783+
)
784+
785+
if no_serve_llms_txt:
786+
raise Http404()
787+
788+
structlog.contextvars.bind_contextvars(
789+
project_slug=project.slug,
790+
version_slug=version.slug,
791+
)
792+
793+
try:
794+
response = self._serve_docs(
795+
request=request,
796+
project=project,
797+
version=version,
798+
filename="llms.txt",
799+
check_if_exists=True,
800+
)
801+
log.info("Serving custom llms.txt file.")
802+
return response
803+
except StorageFileNotFound:
804+
pass
805+
806+
sitemap_url = "{scheme}://{domain}/sitemap.xml".format(
807+
scheme="https",
808+
domain=project.subdomain(),
809+
)
810+
context = {
811+
"sitemap_url": sitemap_url,
812+
"hidden_paths": self._get_hidden_paths(project),
813+
}
814+
return render(
815+
request,
816+
"llms.txt",
817+
context,
818+
content_type="text/plain",
819+
)
820+
821+
def _get_hidden_paths(self, project):
822+
hidden_versions = project.versions(manager=INTERNAL).public().filter(hidden=True)
823+
resolver = Resolver()
824+
hidden_paths = [
825+
resolver.resolve_path(project, version_slug=version.slug) for version in hidden_versions
826+
]
827+
return hidden_paths
828+
829+
def _get_project(self):
830+
return self.request.unresolved_domain.project
831+
832+
def _get_version(self):
833+
return None
834+
835+
836+
class ServeLLMSTXT(SettingsOverrideObject):
837+
_default_class = ServeLLMSTXTBase
838+
839+
739840
class ServeSitemapXMLBase(CDNCacheControlMixin, CDNCacheTagsMixin, View):
740841
"""Serve sitemap.xml from the domain's root."""
741842

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Delisted project, blocking large language model crawlers
2+
# See: https://docs.readthedocs.io/en/stable/unofficial-projects.html
3+
User-agent: *
4+
Disallow: /

0 commit comments

Comments
 (0)