Skip to content

Conversation

@dmux
Copy link

@dmux dmux commented Nov 19, 2025

This pull request enhances the FetchWebPageTool by adding logic to automatically detect and fetch special llms.txt and llms-full.txt files from the root of provided URLs, in addition to the original URLs. It also improves error handling and logging for candidate URL processing.

Copilot AI review requested due to automatic review settings November 19, 2025 02:00
Copilot finished reviewing on behalf of dmux November 19, 2025 02:02
@dmux
Copy link
Author

dmux commented Nov 19, 2025

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the FetchWebPageTool to automatically detect and fetch llms.txt and llms-full.txt files from the root of provided URLs, following the llmstxt.org standard for LLM-optimized content. The implementation fetches these candidate URLs in parallel with the original URLs and processes them with special handling for empty content.

Key Changes

  • Added automatic detection of llms.txt standard files at domain roots
  • Implemented parallel fetching of original and candidate URLs for better performance
  • Refactored content processing into a reusable function with separate handling for candidates vs. original URLs

@dmux
Copy link
Author

dmux commented Nov 19, 2025

@copilot open a new pull request to apply changes based on the comments in this thread

@TylerLeonhardt
Copy link
Member

I'd rather something like this be handled in the Core repo:
https://github.com/microsoft/vscode/blob/8834c469c06b2e589979e815512cdf4c59918182/src/vs/workbench/contrib/chat/electron-browser/tools/fetchPageTool.ts

but I think we need to careful because we don't want to load content from the web that the user did not explicitly allow. We should require confirmation to load llms.txt/llms-full.txt instead of the actual content of the page (+a way to "always do this")

@connor4312
Copy link
Member

I think if the user permits us to fetch from a webpage, it would be safe to request <site>/llms.txt eagerly (it is specced to only appear on the root path). I can't think of a case where a user would trust a subpath of a domain but not the root path--subpaths can be controlled by different entities, like Github repos, where the root path is always going to be the primary 'most trusted' entity. But I also am not sure how that should get represented the tool calls.

@TylerLeonhardt
Copy link
Member

TylerLeonhardt commented Nov 21, 2025

Fair, we could fetch both from a trust perspective, but I think the user should still decide what they want to go in considering the content may be different.

If they have bad luck with llms.txt, and want to be specific and grab a route, doing so should be possible.

@connor4312
Copy link
Member

Yea, agreed, I'm not sure how to reconcile those. I found this random directory of sites that use llms.txt https://directory.llmstxt.cloud/ which includes CloudFlare and X. Maybe we try a few of those and see how it feels with various approaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants