feat: integrate llmstxt.org standard into FetchWebPageTool #2075

dmux · 2025-11-19T02:00:06Z

This pull request enhances the FetchWebPageTool by adding logic to automatically detect and fetch special llms.txt and llms-full.txt files from the root of provided URLs, in addition to the original URLs. It also improves error handling and logging for candidate URL processing.

dmux · 2025-11-19T02:03:33Z

@microsoft-github-policy-service agree

Copilot

Pull Request Overview

This PR enhances the FetchWebPageTool to automatically detect and fetch llms.txt and llms-full.txt files from the root of provided URLs, following the llmstxt.org standard for LLM-optimized content. The implementation fetches these candidate URLs in parallel with the original URLs and processes them with special handling for empty content.

Key Changes

Added automatic detection of llms.txt standard files at domain roots
Implemented parallel fetching of original and candidate URLs for better performance
Refactored content processing into a reusable function with separate handling for candidates vs. original URLs

src/extension/tools/vscode-node/fetchWebPageTool.tsx

dmux · 2025-11-19T09:31:24Z

@copilot open a new pull request to apply changes based on the comments in this thread

TylerLeonhardt · 2025-11-21T05:28:18Z

I'd rather something like this be handled in the Core repo:
https://github.com/microsoft/vscode/blob/8834c469c06b2e589979e815512cdf4c59918182/src/vs/workbench/contrib/chat/electron-browser/tools/fetchPageTool.ts

but I think we need to careful because we don't want to load content from the web that the user did not explicitly allow. We should require confirmation to load llms.txt/llms-full.txt instead of the actual content of the page (+a way to "always do this")

connor4312 · 2025-11-21T19:45:20Z

I think if the user permits us to fetch from a webpage, it would be safe to request <site>/llms.txt eagerly (it is specced to only appear on the root path). I can't think of a case where a user would trust a subpath of a domain but not the root path--subpaths can be controlled by different entities, like Github repos, where the root path is always going to be the primary 'most trusted' entity. But I also am not sure how that should get represented the tool calls.

TylerLeonhardt · 2025-11-21T19:48:45Z

Fair, we could fetch both from a trust perspective, but I think the user should still decide what they want to go in considering the content may be different.

If they have bad luck with llms.txt, and want to be specific and grab a route, doing so should be possible.

connor4312 · 2025-11-21T23:17:48Z

Yea, agreed, I'm not sure how to reconcile those. I found this random directory of sites that use llms.txt https://directory.llmstxt.cloud/ which includes CloudFlare and X. Maybe we try a few of those and see how it feels with various approaches

feat: integrate llmstxt.org standard into FetchWebPageTool

4cf8726

Copilot AI review requested due to automatic review settings November 19, 2025 02:00

Copilot started reviewing on behalf of dmux November 19, 2025 02:00 View session

vs-code-engineering bot assigned dbaeumer Nov 19, 2025

vs-code-engineering bot added the triage-needed label Nov 19, 2025

Copilot finished reviewing on behalf of dmux November 19, 2025 02:02

Copilot AI reviewed Nov 19, 2025

View reviewed changes

dmux added 2 commits November 19, 2025 08:08

fix: update processContent parameter types for improved type safety

0954f1a

feat: enhance candidate URL generation logic for llms.txt support

4c9da1a

dbaeumer assigned connor4312 and unassigned dbaeumer Nov 19, 2025

dbaeumer removed the triage-needed label Nov 19, 2025

connor4312 assigned TylerLeonhardt and unassigned connor4312 Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: integrate llmstxt.org standard into FetchWebPageTool #2075

feat: integrate llmstxt.org standard into FetchWebPageTool #2075

Uh oh!

dmux commented Nov 19, 2025

Uh oh!

dmux commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmux commented Nov 19, 2025

Uh oh!

TylerLeonhardt commented Nov 21, 2025

Uh oh!

connor4312 commented Nov 21, 2025

Uh oh!

TylerLeonhardt commented Nov 21, 2025 •

edited

Loading

Uh oh!

connor4312 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: integrate llmstxt.org standard into FetchWebPageTool #2075

Are you sure you want to change the base?

feat: integrate llmstxt.org standard into FetchWebPageTool #2075

Uh oh!

Conversation

dmux commented Nov 19, 2025

Uh oh!

dmux commented Nov 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmux commented Nov 19, 2025

Uh oh!

TylerLeonhardt commented Nov 21, 2025

Uh oh!

connor4312 commented Nov 21, 2025

Uh oh!

TylerLeonhardt commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connor4312 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TylerLeonhardt commented Nov 21, 2025 •

edited

Loading