Refactor generation of `llms.txt` #4

Viicos · 2025-03-17T14:24:13Z

Fixes #1.

Missing doc/tests updates, but opening to get early feedback on this one.

aristideubertas · 2025-03-18T00:08:20Z

Thank you for this PR, it will be very useful for many projects.

"We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with .md appended. (URLs without file names should append index.html.md instead.)"

Will something like this also likely be a feature at some point? I understand it might not be completely relevant to this PR.

pawamoy · 2025-03-18T12:45:14Z

@Viicos thanks a lot for the PR! Reviewing now. @aristideubertas I believe that's what this PR does: it creates a .md page for each selected HTML page, and links to them in the root llms.txt file.

pawamoy

Looks super good thanks! I only have a few chore/nit/suggestion comments 🙂

src/mkdocs_llmstxt/_internal/debug.py

src/mkdocs_llmstxt/_internal/plugin.py

src/mkdocs_llmstxt/_internal/preprocess.py

src/mkdocs_llmstxt/_internal/plugin.py

README.md

Viicos · 2025-03-20T15:53:57Z

One thing that would be great to have as well is a way to disable the plugin in development (i.e. when running mkdocs serve), as the html parsing and markdown conversion is taking quite some time. Unfortunately, MkDocs doesn't provide any mechanism to differentiate between dev and build. Do you have any insights on how we could achieve this?

pawamoy · 2025-03-20T22:03:10Z

You could use an env var:

https://www.mkdocs.org/user-guide/configuration/#enabled-option

Or we could implement the enabled config option ourselves, and allow values like true (always), false (never), serve (only when serving), build (only when building).

Viicos · 2025-03-21T15:13:53Z

You could use an env var:

https://www.mkdocs.org/user-guide/configuration/#enabled-option

Or we could implement the enabled config option ourselves, and allow values like true (always), false (never), serve (only when serving), build (only when building).

Ah great, wasn't aware of this new MkDocs feature, it perfectly fulfills the use case.

Viicos · 2025-03-21T15:49:22Z

One thing that we also discussed internally at Pydantic:

As we mentioned in #1 (comment), we were asking ourselves about llms-full.txt.

The spec mentions (and was recently updated about this) the FastHTML project as an example, and states:

This proposal does not include any particular recommendation for how to process the llms.txt file, since it will depend on the application. For example, the FastHTML project opted to automatically expand the llms.txt to two markdown files with the contents of the linked URLs, using an XML-based structure suitable for use in LLMs such as Claude. The two files are: llms-ctx.txt, which does not include the optional URLs, and llms-ctx-full.txt, which does include them.

I'm not sure if this XML-based structure is standardized. They mention their llms_txt2ctx CLI tool, which seems to create an "XML LLM context". I have no idea if LLMs are able to understand this file.

It seems like a lot of confusion arose from this given example. llms.txt hubs such as https://llmstxthub.com mention both llms.txt and llms-full.txt files as if it was something standard, but each website seems to be implementing the "full" output as they see it (this one for instance, is just expending the MD content — and it seems to be the most common pattern).

With @samuelcolvin, we think that having a llms-full.txt file generated (by expending/concatenating all the linked MD files in llms.txt) is worthwhile ¹. This isn't specified anywhere however, so what I can propose is that we introduce a hook in this plugin:

A callable that would take two arguments: the llms.txt text content and a dict[str, list[PageInfo]] (mapping of the section names to the list of page infos — PageInfo could be a stripped down version of the current _MDPageInfo structure in this PR, and would only include the title and content). The callable would return the content of the llms-full.txt file (name can either be hardcoded, or configurable).

What do you think?

Although I have no idea if such a file is taken into account by any LLM. ↩

pawamoy · 2025-03-21T16:04:38Z

llms.txt hubs such as https://llmstxthub.com/ mention both llms.txt and llms-full.txt files as if it was something standard

Right, that must be why I got confused too. I'll drop them a message on their Discord server (fastai's server, there's a llmstxt channel).

With @samuelcolvin, we think that having a llms-full.txt file generated (by expending/concatenating all the linked MD files in llms.txt) is worthwhile.

I share that sentiment. Context windows are likely to grow in the future so even a big file could be loaded by future models.

The callable would return the content of the llms-full.txt file (name can either be hardcoded, or configurable).

...and? 😄 How would the return value be used? Do you mean the plugin would execute this hook, get the return value, and... generate the /llms-full.txt file at the site root? Isn't this easier to do this automatically from the configuration of the llms.txt file (the sections with all links), where we concatenate internal pages (like the plugin does currently)? Then we'd just have a toggle to enable/disable llms-full.txt generation?

pawamoy · 2025-03-21T16:10:10Z

OK my confusion likely comes from https://llmstxt.site/ and https://directory.llmstxt.cloud/ rather than https://llmstxthub.com/.

Viicos · 2025-03-21T16:10:28Z

Isn't this easier to do this automatically from the configuration of the llms.txt file (the sections with all links), where we concatenate internal pages (like the plugin does currently)? Then we'd just have a toggle to enable/disable llms-full.txt generation?

We can also go this way, but it will make the plugin "opinionated" in some way as this llms-full.txt file isn't part of the spec. But I'm happy to do so as well. Anyway, the spec is highly subject to change and users of the plugin should expect (breaking) changes.

pawamoy · 2025-03-21T16:11:58Z

We can also go this way, but it will make the plugin "opinionated" in some way as this llms-full.txt file isn't part of the spec. But I'm happy to do so as well. Anyway, the spec is highly subject to change and users of the plugin should expect (breaking) changes.

Understood. Well, I'd probably prefer being opinionated anyway (as well as users I'm sure, who generally prefer declarative stuff rather than having to hook into Python with custom scripts), at least until we get a clear answer. I posted this message on fastai's discord server: https://discord.com/channels/689892369998676007/1279960087221239808/1352675464870625423.

pawamoy · 2025-03-24T14:20:10Z

Answer from Jeremy Howard (@jph00, feel free to unsubscribe, and thanks!):

Yes, informally it's a thing. Since no clients yet understand llms.txt natively, we have to expand the links and make that expanded version available. But hopefully at some point this won't be needed any more since model clients should just follow the links themselves.

Viicos · 2025-03-26T17:16:09Z

Got it 👍 I'll push a commit enabling support for the full output (with a boolean configuration to enable/disable it).

pawamoy · 2025-04-02T13:11:05Z

Nice! PR starts to look complete to me 🙂

Viicos · 2025-04-02T13:27:39Z

The current llms.txt output for the plugin looks wrong, I need to investigate

pawamoy · 2025-04-02T14:50:33Z

# mkdocs-llmstxt

> MkDocs plugin to generate an /llms.txt file.

This plugin automatically generates llms.txt files.

## Usage documentation

- [Overview](http://127.0.0.1:8000/mkdocs-llmstxt/index.md)
- [<code class="doc-symbol doc-symbol-nav doc-symbol-module"></code> mkdocs_llmstxt](http://127.0.0.1:8000/mkdocs-llmstxt/reference/mkdocs_llmstxt/index.md)

I suppose you mean the <code> thing? It comes from the title of the API reference page, which are auto-generated. It will be fixed if I update the project from the template. I'll do that in main and you can merge/rebase.

pawamoy · 2025-04-02T14:54:30Z

Done!

Viicos · 2025-04-07T15:44:56Z

Done!

Ah great thanks, I just rebased an can confirm it works as expected. I think this should be ready now.

pawamoy · 2025-04-08T11:18:08Z

Thank you so much for your work on this @Viicos! I've pushed a few changes (nitpicks). Locally mypy is bothering me with this:

src/mkdocs_llmstxt/_internal/plugin.py:184: error: f-string expression part cannot include a backslash  [syntax]

...yet the code seems to work fine, so not sure why we get this warning.

EDIT: yep code runs fine, just not on Python below 3.12 😅

pawamoy · 2025-04-08T11:20:48Z

I noticed something at the end of the full file too:

{element.find('code').get_text()}

```

", "html.parser"))  # type: ignore[union-attr]

```

pawamoy · 2025-04-08T11:21:52Z

Ah right, it's because the source code contains HTML, which is not escaped. Not sure how to fix this 🤔 Could be a bug in Markdownify.

Viicos · 2025-04-08T12:30:40Z

Yeah probably looks like a bug, it's a bit weird as I can't manage to produce a MRE.

pawamoy · 2025-04-08T12:33:17Z

Could be mdformat too 🤔 But seems less likely.

pawamoy · 2025-04-08T12:35:56Z

Or maybe our own code:

    # Remove line numbers from code blocks.
    for element in soup.find_all("table", attrs={"class": "highlighttable"}):
        element.replace_with(Soup(f"<pre>{element.find('code').get_text()}</pre>", "html.parser"))  # type: ignore[union-attr]

(would be funny that the code that triggered the issue is also the code that made us aware of it)

But this is operating on the soup, so I don't see why this would cause escaping issues.

pawamoy · 2025-04-08T12:39:18Z

Ah, wait, yes, this might be it. We replace the table with a pre, but element.find('code').get_text() in this case contains HTML too (this very same code 💫 😂). I think we should escape the call to element.find('code').get_text() 🙂

Viicos · 2025-04-08T12:44:12Z

But this is operating on the soup, so I don't see why this would cause escaping issues.

This is actually it, I'm looking into it.

Viicos · 2025-04-08T12:52:51Z

Fixed and confirmed that it works with:

from itertools import chain

import html

from bs4 import BeautifulSoup as Soup, Tag
from markdownify import ATX, MarkdownConverter

import mdformat

def _language_callback(tag: Tag) -> str:
    for css_class in chain(tag.get("class") or (), (tag.parent.get("class") or ()) if tag.parent else ()):
        if css_class.startswith("language-"):
            return css_class[9:]
    return ""


_converter = MarkdownConverter(
    bullets="-",
    code_language_callback=_language_callback,
    escape_underscores=False,
    heading_style=ATX,
)

h = '''
<div class="language-python highlight"><table class="highlighttable"><tbody><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre id="__code_12"><span></span><button class="md-clipboard md-icon" title="Copy to clipboard" data-clipboard-target="#__code_12 > code"></button><code tabindex="0"><span class="n">element</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">Soup</span><span class="p">(</span><span class="sa">f</span><span class="s2">"&lt;pre&gt;</span><span class="si">{</span><span class="n">element</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'code'</span><span class="p">)</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span><span class="si">}</span><span class="s2">&lt;/pre&gt;"</span><span class="p">,</span> <span class="s2">"html.parser"</span><span class="p">))</span>  <span class="c1"># type: ignore[union-attr]</span>
</code></pre></div></td></tr></tbody></table></div>
'''

soup = Soup(h, "html.parser")
for element in soup.find_all("table", attrs={"class": "highlighttable"}):
    element.replace_with(Soup(f"<pre>{html.escape(element.find('code').get_text())}</pre>", "html.parser"))  # type: ignore[union-attr]

print(mdformat.text(_converter.convert_soup(soup), options={"wrap": "no"}))

pawamoy · 2025-04-08T12:57:14Z

Fantastic, thanks!

pawamoy · 2025-04-08T12:57:43Z

Cool case of literate bug 😄

Viicos · 2025-04-08T13:04:55Z

Cool case of literate bug 😄

That was quite meta indeed, thanks for merging and the reviews

dreamorosi · 2025-04-08T14:04:27Z

Thank you both for working on this! When will it be released?

Viicos · 2025-04-08T14:13:44Z

@dreamorosi see https://github.com/pawamoy/mkdocs-llmstxt/releases/tag/0.2.0.

Refactor generation of llms.txt

429203f

pawamoy reviewed Mar 18, 2025

View reviewed changes

Feedback

4413782

Viicos commented Mar 19, 2025

View reviewed changes

src/mkdocs_llmstxt/_internal/plugin.py Show resolved Hide resolved

Update readme, some tweaks

a06aad4

Viicos marked this pull request as ready for review March 19, 2025 13:31

Viicos commented Mar 19, 2025

View reviewed changes

README.md Show resolved Hide resolved

pawamoy reviewed Mar 19, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

readme on one line

0d93bd1

Viicos added 2 commits March 26, 2025 18:40

Add support for full output

1b89ff9

Preserve sections order

255b8ba

Fix own docs

a8e2b2b

A few more nits

6bee2dc

Don't use backslash in f-string

7e2b8ab

Escape html

838d3b9

pawamoy merged commit 1f0e417 into pawamoy:main Apr 8, 2025
25 checks passed

Viicos deleted the issue-1 branch August 7, 2025 15:23

Uh oh!

Refactor generation of llms.txt #4

Refactor generation of llms.txt #4

Uh oh!

Conversation

Viicos commented Mar 17, 2025

Uh oh!

aristideubertas commented Mar 18, 2025

Uh oh!

pawamoy commented Mar 18, 2025

Uh oh!

pawamoy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Viicos commented Mar 20, 2025

Uh oh!

pawamoy commented Mar 20, 2025

Uh oh!

Viicos commented Mar 21, 2025

Uh oh!

Viicos commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

pawamoy commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Viicos commented Mar 21, 2025

Uh oh!

pawamoy commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Viicos commented Mar 26, 2025

Uh oh!

pawamoy commented Apr 2, 2025

Uh oh!

Viicos commented Apr 2, 2025

Uh oh!

pawamoy commented Apr 2, 2025

Uh oh!

pawamoy commented Apr 2, 2025

Uh oh!

Viicos commented Apr 7, 2025

Uh oh!

pawamoy commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Apr 8, 2025

Uh oh!

pawamoy commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Viicos commented Apr 8, 2025

Uh oh!

pawamoy commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pawamoy commented Apr 8, 2025

Uh oh!

Viicos commented Apr 8, 2025

Uh oh!

Viicos commented Apr 8, 2025

Refactor generation of `llms.txt` #4

Refactor generation of `llms.txt` #4

Viicos commented Mar 21, 2025 •

edited

Loading

pawamoy commented Mar 21, 2025 •

edited

Loading

pawamoy commented Mar 21, 2025 •

edited

Loading

pawamoy commented Mar 21, 2025 •

edited

Loading

pawamoy commented Mar 24, 2025 •

edited

Loading

pawamoy commented Apr 8, 2025 •

edited

Loading

pawamoy commented Apr 8, 2025 •

edited

Loading

pawamoy commented Apr 8, 2025 •

edited

Loading

pawamoy commented Apr 8, 2025 •

edited

Loading