Skip to content

Conversation

@Viicos
Copy link
Contributor

@Viicos Viicos commented Mar 17, 2025

Fixes #1.

Missing doc/tests updates, but opening to get early feedback on this one.

@aristideubertas
Copy link

Thank you for this PR, it will be very useful for many projects.

Quoting https://llmstxt.org:

"We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with .md appended. (URLs without file names should append index.html.md instead.)"

Will something like this also likely be a feature at some point? I understand it might not be completely relevant to this PR.

@pawamoy
Copy link
Owner

pawamoy commented Mar 18, 2025

@Viicos thanks a lot for the PR! Reviewing now. @aristideubertas I believe that's what this PR does: it creates a .md page for each selected HTML page, and links to them in the root llms.txt file.

Copy link
Owner

@pawamoy pawamoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super good thanks! I only have a few chore/nit/suggestion comments 🙂

@Viicos Viicos marked this pull request as ready for review March 19, 2025 13:31
@Viicos
Copy link
Contributor Author

Viicos commented Mar 20, 2025

One thing that would be great to have as well is a way to disable the plugin in development (i.e. when running mkdocs serve), as the html parsing and markdown conversion is taking quite some time. Unfortunately, MkDocs doesn't provide any mechanism to differentiate between dev and build. Do you have any insights on how we could achieve this?

@pawamoy
Copy link
Owner

pawamoy commented Mar 20, 2025

You could use an env var:

https://www.mkdocs.org/user-guide/configuration/#enabled-option

Or we could implement the enabled config option ourselves, and allow values like true (always), false (never), serve (only when serving), build (only when building).

@Viicos
Copy link
Contributor Author

Viicos commented Mar 21, 2025

You could use an env var:

https://www.mkdocs.org/user-guide/configuration/#enabled-option

Or we could implement the enabled config option ourselves, and allow values like true (always), false (never), serve (only when serving), build (only when building).

Ah great, wasn't aware of this new MkDocs feature, it perfectly fulfills the use case.

@Viicos
Copy link
Contributor Author

Viicos commented Mar 21, 2025

One thing that we also discussed internally at Pydantic:

As we mentioned in #1 (comment), we were asking ourselves about llms-full.txt.

The spec mentions (and was recently updated about this) the FastHTML project as an example, and states:

This proposal does not include any particular recommendation for how to process the llms.txt file, since it will depend on the application. For example, the FastHTML project opted to automatically expand the llms.txt to two markdown files with the contents of the linked URLs, using an XML-based structure suitable for use in LLMs such as Claude. The two files are: llms-ctx.txt, which does not include the optional URLs, and llms-ctx-full.txt, which does include them.

I'm not sure if this XML-based structure is standardized. They mention their llms_txt2ctx CLI tool, which seems to create an "XML LLM context". I have no idea if LLMs are able to understand this file.

It seems like a lot of confusion arose from this given example. llms.txt hubs such as https://llmstxthub.com mention both llms.txt and llms-full.txt files as if it was something standard, but each website seems to be implementing the "full" output as they see it (this one for instance, is just expending the MD content — and it seems to be the most common pattern).

With @samuelcolvin, we think that having a llms-full.txt file generated (by expending/concatenating all the linked MD files in llms.txt) is worthwhile 1. This isn't specified anywhere however, so what I can propose is that we introduce a hook in this plugin:

A callable that would take two arguments: the llms.txt text content and a dict[str, list[PageInfo]] (mapping of the section names to the list of page infos — PageInfo could be a stripped down version of the current _MDPageInfo structure in this PR, and would only include the title and content). The callable would return the content of the llms-full.txt file (name can either be hardcoded, or configurable).

What do you think?

Footnotes

  1. Although I have no idea if such a file is taken into account by any LLM.

@pawamoy
Copy link
Owner

pawamoy commented Mar 21, 2025

llms.txt hubs such as https://llmstxthub.com/ mention both llms.txt and llms-full.txt files as if it was something standard

Right, that must be why I got confused too. I'll drop them a message on their Discord server (fastai's server, there's a llmstxt channel).

With @samuelcolvin, we think that having a llms-full.txt file generated (by expending/concatenating all the linked MD files in llms.txt) is worthwhile.

I share that sentiment. Context windows are likely to grow in the future so even a big file could be loaded by future models.

The callable would return the content of the llms-full.txt file (name can either be hardcoded, or configurable).

...and? 😄 How would the return value be used? Do you mean the plugin would execute this hook, get the return value, and... generate the /llms-full.txt file at the site root? Isn't this easier to do this automatically from the configuration of the llms.txt file (the sections with all links), where we concatenate internal pages (like the plugin does currently)? Then we'd just have a toggle to enable/disable llms-full.txt generation?

@pawamoy
Copy link
Owner

pawamoy commented Mar 21, 2025

OK my confusion likely comes from https://llmstxt.site/ and https://directory.llmstxt.cloud/ rather than https://llmstxthub.com/.

@Viicos
Copy link
Contributor Author

Viicos commented Mar 21, 2025

Isn't this easier to do this automatically from the configuration of the llms.txt file (the sections with all links), where we concatenate internal pages (like the plugin does currently)? Then we'd just have a toggle to enable/disable llms-full.txt generation?

We can also go this way, but it will make the plugin "opinionated" in some way as this llms-full.txt file isn't part of the spec. But I'm happy to do so as well. Anyway, the spec is highly subject to change and users of the plugin should expect (breaking) changes.

@pawamoy
Copy link
Owner

pawamoy commented Mar 21, 2025

We can also go this way, but it will make the plugin "opinionated" in some way as this llms-full.txt file isn't part of the spec. But I'm happy to do so as well. Anyway, the spec is highly subject to change and users of the plugin should expect (breaking) changes.

Understood. Well, I'd probably prefer being opinionated anyway (as well as users I'm sure, who generally prefer declarative stuff rather than having to hook into Python with custom scripts), at least until we get a clear answer. I posted this message on fastai's discord server: https://discord.com/channels/689892369998676007/1279960087221239808/1352675464870625423.

@pawamoy
Copy link
Owner

pawamoy commented Mar 24, 2025

Answer from Jeremy Howard (@jph00, feel free to unsubscribe, and thanks!):

Yes, informally it's a thing. Since no clients yet understand llms.txt natively, we have to expand the links and make that expanded version available. But hopefully at some point this won't be needed any more since model clients should just follow the links themselves.

@Viicos
Copy link
Contributor Author

Viicos commented Mar 26, 2025

Got it 👍 I'll push a commit enabling support for the full output (with a boolean configuration to enable/disable it).

@pawamoy
Copy link
Owner

pawamoy commented Apr 2, 2025

Nice! PR starts to look complete to me 🙂

@Viicos
Copy link
Contributor Author

Viicos commented Apr 2, 2025

The current llms.txt output for the plugin looks wrong, I need to investigate

@pawamoy
Copy link
Owner

pawamoy commented Apr 2, 2025

# mkdocs-llmstxt

> MkDocs plugin to generate an /llms.txt file.

This plugin automatically generates llms.txt files.

## Usage documentation

- [Overview](http://127.0.0.1:8000/mkdocs-llmstxt/index.md)
- [<code class="doc-symbol doc-symbol-nav doc-symbol-module"></code> mkdocs_llmstxt](http://127.0.0.1:8000/mkdocs-llmstxt/reference/mkdocs_llmstxt/index.md)

I suppose you mean the <code> thing? It comes from the title of the API reference page, which are auto-generated. It will be fixed if I update the project from the template. I'll do that in main and you can merge/rebase.

@pawamoy
Copy link
Owner

pawamoy commented Apr 2, 2025

Done!

@Viicos
Copy link
Contributor Author

Viicos commented Apr 7, 2025

Done!

Ah great thanks, I just rebased an can confirm it works as expected. I think this should be ready now.

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Thank you so much for your work on this @Viicos! I've pushed a few changes (nitpicks). Locally mypy is bothering me with this:

src/mkdocs_llmstxt/_internal/plugin.py:184: error: f-string expression part cannot include a backslash  [syntax]

...yet the code seems to work fine, so not sure why we get this warning.

EDIT: yep code runs fine, just not on Python below 3.12 😅

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

I noticed something at the end of the full file too:

{element.find('code').get_text()}

```

", "html.parser"))  # type: ignore[union-attr]

```

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Ah right, it's because the source code contains HTML, which is not escaped. Not sure how to fix this 🤔 Could be a bug in Markdownify.

@Viicos
Copy link
Contributor Author

Viicos commented Apr 8, 2025

Yeah probably looks like a bug, it's a bit weird as I can't manage to produce a MRE.

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Could be mdformat too 🤔 But seems less likely.

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Or maybe our own code:

    # Remove line numbers from code blocks.
    for element in soup.find_all("table", attrs={"class": "highlighttable"}):
        element.replace_with(Soup(f"<pre>{element.find('code').get_text()}</pre>", "html.parser"))  # type: ignore[union-attr]

(would be funny that the code that triggered the issue is also the code that made us aware of it)

But this is operating on the soup, so I don't see why this would cause escaping issues.

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Ah, wait, yes, this might be it. We replace the table with a pre, but element.find('code').get_text() in this case contains HTML too (this very same code 💫 😂). I think we should escape the call to element.find('code').get_text() 🙂

@Viicos
Copy link
Contributor Author

Viicos commented Apr 8, 2025

But this is operating on the soup, so I don't see why this would cause escaping issues.

This is actually it, I'm looking into it.

@Viicos
Copy link
Contributor Author

Viicos commented Apr 8, 2025

Fixed and confirmed that it works with:

from itertools import chain

import html

from bs4 import BeautifulSoup as Soup, Tag
from markdownify import ATX, MarkdownConverter

import mdformat

def _language_callback(tag: Tag) -> str:
    for css_class in chain(tag.get("class") or (), (tag.parent.get("class") or ()) if tag.parent else ()):
        if css_class.startswith("language-"):
            return css_class[9:]
    return ""


_converter = MarkdownConverter(
    bullets="-",
    code_language_callback=_language_callback,
    escape_underscores=False,
    heading_style=ATX,
)

h = '''
<div class="language-python highlight"><table class="highlighttable"><tbody><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre id="__code_12"><span></span><button class="md-clipboard md-icon" title="Copy to clipboard" data-clipboard-target="#__code_12 > code"></button><code tabindex="0"><span class="n">element</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">Soup</span><span class="p">(</span><span class="sa">f</span><span class="s2">"&lt;pre&gt;</span><span class="si">{</span><span class="n">element</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'code'</span><span class="p">)</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span><span class="si">}</span><span class="s2">&lt;/pre&gt;"</span><span class="p">,</span> <span class="s2">"html.parser"</span><span class="p">))</span>  <span class="c1"># type: ignore[union-attr]</span>
</code></pre></div></td></tr></tbody></table></div>
'''

soup = Soup(h, "html.parser")
for element in soup.find_all("table", attrs={"class": "highlighttable"}):
    element.replace_with(Soup(f"<pre>{html.escape(element.find('code').get_text())}</pre>", "html.parser"))  # type: ignore[union-attr]

print(mdformat.text(_converter.convert_soup(soup), options={"wrap": "no"}))

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Fantastic, thanks!

@pawamoy
Copy link
Owner

pawamoy commented Apr 8, 2025

Cool case of literate bug 😄

@pawamoy pawamoy merged commit 1f0e417 into pawamoy:main Apr 8, 2025
25 checks passed
@Viicos
Copy link
Contributor Author

Viicos commented Apr 8, 2025

Cool case of literate bug 😄

That was quite meta indeed, thanks for merging and the reviews

@dreamorosi
Copy link

Thank you both for working on this! When will it be released?

@Viicos
Copy link
Contributor Author

Viicos commented Apr 8, 2025

@dreamorosi see https://github.com/pawamoy/mkdocs-llmstxt/releases/tag/0.2.0.

@Viicos Viicos deleted the issue-1 branch August 7, 2025 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: Auto-generate llms.txt

4 participants