-
-
Notifications
You must be signed in to change notification settings - Fork 9
Refactor generation of llms.txt
#4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for this PR, it will be very useful for many projects. Quoting https://llmstxt.org: "We furthermore propose that pages on websites that have information that might be useful for LLMs to read provide a clean markdown version of those pages at the same URL as the original page, but with .md appended. (URLs without file names should append index.html.md instead.)" Will something like this also likely be a feature at some point? I understand it might not be completely relevant to this PR. |
|
@Viicos thanks a lot for the PR! Reviewing now. @aristideubertas I believe that's what this PR does: it creates a |
pawamoy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks super good thanks! I only have a few chore/nit/suggestion comments 🙂
|
One thing that would be great to have as well is a way to disable the plugin in development (i.e. when running |
|
You could use an env var: https://www.mkdocs.org/user-guide/configuration/#enabled-option Or we could implement the |
Ah great, wasn't aware of this new MkDocs feature, it perfectly fulfills the use case. |
|
One thing that we also discussed internally at Pydantic: As we mentioned in #1 (comment), we were asking ourselves about The spec mentions (and was recently updated about this) the FastHTML project as an example, and states:
I'm not sure if this XML-based structure is standardized. They mention their It seems like a lot of confusion arose from this given example. llms.txt hubs such as https://llmstxthub.com mention both With @samuelcolvin, we think that having a A callable that would take two arguments: the What do you think? Footnotes
|
Right, that must be why I got confused too. I'll drop them a message on their Discord server (fastai's server, there's a llmstxt channel).
I share that sentiment. Context windows are likely to grow in the future so even a big file could be loaded by future models.
...and? 😄 How would the return value be used? Do you mean the plugin would execute this hook, get the return value, and... generate the |
|
OK my confusion likely comes from https://llmstxt.site/ and https://directory.llmstxt.cloud/ rather than https://llmstxthub.com/. |
We can also go this way, but it will make the plugin "opinionated" in some way as this |
Understood. Well, I'd probably prefer being opinionated anyway (as well as users I'm sure, who generally prefer declarative stuff rather than having to hook into Python with custom scripts), at least until we get a clear answer. I posted this message on fastai's discord server: https://discord.com/channels/689892369998676007/1279960087221239808/1352675464870625423. |
|
Answer from Jeremy Howard (@jph00, feel free to unsubscribe, and thanks!):
|
|
Got it 👍 I'll push a commit enabling support for the full output (with a boolean configuration to enable/disable it). |
|
Nice! PR starts to look complete to me 🙂 |
|
The current llms.txt output for the plugin looks wrong, I need to investigate |
I suppose you mean the |
|
Done! |
Ah great thanks, I just rebased an can confirm it works as expected. I think this should be ready now. |
|
Thank you so much for your work on this @Viicos! I've pushed a few changes (nitpicks). Locally mypy is bothering me with this: ...yet the code seems to work fine, so not sure why we get this warning. EDIT: yep code runs fine, just not on Python below 3.12 😅 |
|
I noticed something at the end of the full file too: |
|
Ah right, it's because the source code contains HTML, which is not escaped. Not sure how to fix this 🤔 Could be a bug in Markdownify. |
|
Yeah probably looks like a bug, it's a bit weird as I can't manage to produce a MRE. |
|
Could be mdformat too 🤔 But seems less likely. |
|
Or maybe our own code: # Remove line numbers from code blocks.
for element in soup.find_all("table", attrs={"class": "highlighttable"}):
element.replace_with(Soup(f"<pre>{element.find('code').get_text()}</pre>", "html.parser")) # type: ignore[union-attr](would be funny that the code that triggered the issue is also the code that made us aware of it) But this is operating on the soup, so I don't see why this would cause escaping issues. |
|
Ah, wait, yes, this might be it. We replace the table with a |
This is actually it, I'm looking into it. |
|
Fixed and confirmed that it works with: from itertools import chain
import html
from bs4 import BeautifulSoup as Soup, Tag
from markdownify import ATX, MarkdownConverter
import mdformat
def _language_callback(tag: Tag) -> str:
for css_class in chain(tag.get("class") or (), (tag.parent.get("class") or ()) if tag.parent else ()):
if css_class.startswith("language-"):
return css_class[9:]
return ""
_converter = MarkdownConverter(
bullets="-",
code_language_callback=_language_callback,
escape_underscores=False,
heading_style=ATX,
)
h = '''
<div class="language-python highlight"><table class="highlighttable"><tbody><tr><td class="linenos"><div class="linenodiv"><pre><span></span><span class="normal">1</span></pre></div></td><td class="code"><div><pre id="__code_12"><span></span><button class="md-clipboard md-icon" title="Copy to clipboard" data-clipboard-target="#__code_12 > code"></button><code tabindex="0"><span class="n">element</span><span class="o">.</span><span class="n">replace_with</span><span class="p">(</span><span class="n">Soup</span><span class="p">(</span><span class="sa">f</span><span class="s2">"<pre></span><span class="si">{</span><span class="n">element</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s1">'code'</span><span class="p">)</span><span class="o">.</span><span class="n">get_text</span><span class="p">()</span><span class="si">}</span><span class="s2"></pre>"</span><span class="p">,</span> <span class="s2">"html.parser"</span><span class="p">))</span> <span class="c1"># type: ignore[union-attr]</span>
</code></pre></div></td></tr></tbody></table></div>
'''
soup = Soup(h, "html.parser")
for element in soup.find_all("table", attrs={"class": "highlighttable"}):
element.replace_with(Soup(f"<pre>{html.escape(element.find('code').get_text())}</pre>", "html.parser")) # type: ignore[union-attr]
print(mdformat.text(_converter.convert_soup(soup), options={"wrap": "no"})) |
|
Fantastic, thanks! |
|
Cool case of literate bug 😄 |
That was quite meta indeed, thanks for merging and the reviews |
|
Thank you both for working on this! When will it be released? |
Fixes #1.
Missing doc/tests updates, but opening to get early feedback on this one.