-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: integrate llmstxt.org standard into FetchWebPageTool #2075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the FetchWebPageTool to automatically detect and fetch llms.txt and llms-full.txt files from the root of provided URLs, following the llmstxt.org standard for LLM-optimized content. The implementation fetches these candidate URLs in parallel with the original URLs and processes them with special handling for empty content.
Key Changes
- Added automatic detection of llms.txt standard files at domain roots
- Implemented parallel fetching of original and candidate URLs for better performance
- Refactored content processing into a reusable function with separate handling for candidates vs. original URLs
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
I'd rather something like this be handled in the Core repo: but I think we need to careful because we don't want to load content from the web that the user did not explicitly allow. We should require confirmation to load llms.txt/llms-full.txt instead of the actual content of the page (+a way to "always do this") |
|
I think if the user permits us to fetch from a webpage, it would be safe to request |
|
Fair, we could fetch both from a trust perspective, but I think the user should still decide what they want to go in considering the content may be different. If they have bad luck with |
|
Yea, agreed, I'm not sure how to reconcile those. I found this random directory of sites that use llms.txt https://directory.llmstxt.cloud/ which includes CloudFlare and X. Maybe we try a few of those and see how it feels with various approaches |
This pull request enhances the
FetchWebPageToolby adding logic to automatically detect and fetch specialllms.txtandllms-full.txtfiles from the root of provided URLs, in addition to the original URLs. It also improves error handling and logging for candidate URL processing.