-
Notifications
You must be signed in to change notification settings - Fork 45
[WEB-4447] Add MDX to Markdown transpilation with content negotiation #3000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
de0ba4d to
f227aa4
Compare
m-hulbert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM (under the assumption that links contain ably-dev because it's a review app. There's a few really minor changes we could address, but I think that's more source-based than the .md generation. 🚀
|
@m-hulbert thanks! let me get the code ready for team review then :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements automatic MDX-to-Markdown transpilation during the Gatsby build process and adds nginx content negotiation to serve markdown documentation to LLMs and text-based tools while maintaining HTML for browsers.
Key Changes:
- New transpilation module that converts MDX files to markdown format with content transformations (frontmatter removal, import/export stripping, template variable replacement, etc.)
- Nginx content negotiation using Accept header mapping to serve
.mdor.htmlfiles based on client preferences - Comprehensive test suite with 16 test cases for content negotiation behavior
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
data/onPostBuild/transpileMdxToMarkdown.ts |
Core transpilation logic with transformation functions for MDX-to-Markdown conversion |
data/onPostBuild/index.ts |
Integrates transpilation into build pipeline |
config/nginx.conf.erb |
Implements content negotiation via map blocks and updates try_files directives |
config/mime.types |
Adds text/markdown MIME type support |
bin/assert-content-negotiation.sh |
Test suite for verifying content negotiation behavior |
.circleci/config.yml |
Adds content negotiation tests to CI pipeline |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through the code and preview app.
Give me some time to review this and test against LLM.
Meanwhile you can check @copilot review comments.
f227aa4 to
f251b2b
Compare
|
@sacOO7 thanks, passed the comments back to Claudius for the appropriate treatment, and I also added some unit tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sacOO7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR assumes that the LLM crawler or web fetch requests content using the following headers: text/markdown, application/markdown, and text/plain. If the default configuration for these crawlers and web fetch is different, then data will be returned in HTML format.
4a9f682 to
d56f83e
Compare
|
|
||
| // Report summary | ||
| if (failureCount > 0) { | ||
| reporter.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| reporter.warn( | |
| reporter.panicOnBuild( |
If any part of the Transpilation to Markdown step fails, it should throw an error and the build should fail, right?
Otherwise, we risk introducing inconsistencies in the Markdown files that will be served to the LLMs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, while we solidify this, I think having graceful degradation and falling back to an HTML response is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
You can create a separate issue to track/handle failing markdowns so they can be addressed in the near future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did research using Opus 4.5
Based on Cloudflare's July 2025 data, the AI crawler market share breakdown
| Crawler | Share (July 2025) | In Your Config? |
|---|---|---|
| GPTBot | 11.7% | ✅ |
| ClaudeBot | ~10% | ✅ |
| Meta-ExternalAgent | 7.5% | ❌ Missing |
| Amazonbot | 5.9% | ❌ |
| Bytespider | 2.4% | ❌ |
| PerplexityBot | ~3% | ✅ |
| Google-Extended | ~5% | ✅ |
Current config covers roughly 30% of AI crawler traffic.
Adding Meta-ExternalAgent, Amazonbot and Bytespider should cover ~45% of all traffic.
sacOO7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from above comments, looks good 👍
d56f83e to
7515c07
Compare
7515c07 to
17d0867
Compare
|
I spent a bit of time trimming some fat and statistics from the commit messages as well |
sacOO7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Implements automatic transpilation of MDX documentation files to
Markdown format during the Gatsby build process. The transpiled Markdown
files are generated in the public/docs/ directory alongside HTML output,
making documentation accessible to LLMs and other text-based tools.
Key features:
- Transpiles all MDX files under src/pages/docs/ to .md format
- Removes frontmatter except title (converted to # heading)
- Removes import/export statements and script tags
- Replaces template variables ({{API_KEY}}, {{RANDOM_CHANNEL_NAME}})
- Preserves JSX components and code blocks as-is
- Smart path mapping: index.mdx → parent.md, file.mdx → file.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implements HTTP content negotiation in nginx to serve markdown versions of documentation pages based on the Accept header. This allows clients (like LLMs and text-based tools) to request markdown format while browsers continue to receive HTML by default. Key features: - Serves markdown for Accept: text/markdown, application/markdown, or text/plain - Maintains backward compatibility (HTML is default) - Works with existing authentication system - Supports both index and non-index file paths - No performance impact (uses nginx map blocks) Content negotiation behavior: - /docs/channels with Accept: text/markdown → serves docs/channels.md - /docs/channels with Accept: text/html → serves docs/channels/index.html - /docs/channels (browser default) → serves docs/channels/index.html - /docs/channels.md (direct access) → serves docs/channels.md Implementation: - Added text/markdown MIME type to config/mime.types - Added text/markdown to gzip_types for compression - Created map blocks to detect Accept header preferences - Updated location blocks to use content-negotiated file paths - Fallback to HTML when markdown doesn't exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds comprehensive CI test suite to verify content negotiation works correctly for all Accept header scenarios. The test suite validates that nginx serves markdown or HTML based on the Accept header, with proper fallback behavior. Test coverage: - Basic content negotiation: text/markdown, application/markdown, text/plain, text/html, */* - Browser behavior: Complex Accept headers, HTML priority when listed first - Direct access: .md and .html file access - Path variations: Index paths, non-index paths, nested paths - Edge cases: 404 handling, fallback behavior, non-docs paths Also fixes nginx map priority order to ensure anchored patterns (^text/html, ^text/plain) are evaluated before wildcard patterns. This ensures "text/html, text/markdown" correctly serves HTML instead of markdown. Changes: - Created bin/assert-content-negotiation.sh with run_test() helper function - Integrated test into CircleCI test-nginx job - Reordered nginx map patterns for correct priority matching 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Enhances content negotiation to serve markdown to LLM bots based on
User-Agent strings, in addition to Accept header detection. This ensures
bots like Claude, ChatGPT, Perplexity, and Google AI get markdown even
if they don't send proper Accept headers.
Bot detection:
- Detects several LLM bot User-Agents (Claude, ChatGPT, Perplexity,
Google AI)
- Conservative list - no generic HTTP libraries to avoid false positives
- Combines with existing Accept header logic using nginx map variables
- Serves markdown if EITHER bot detected OR Accept header requests
markdown
Implementation:
- Added $is_llm_bot map for User-Agent pattern matching
- Updated $docs_file_extension map to combine bot + Accept header
detection
- Uses map variable concatenation:
"${is_llm_bot}${wants_markdown_via_accept}"
- Works seamlessly with existing try_files logic
Testing:
- Added new tests for bot User-Agent detection
- Tests bot override behavior (bot gets markdown even with Accept:
text/html)
- Verified browsers still get HTML by default
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
17d0867 to
40b46d7
Compare
matt423
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly tested and reviewed more on a surface level but works as expected in browser, direct .md request and Copilot got markdown by default too.
Summary
Implements automatic MDX-to-Markdown transpilation during the Gatsby build process and adds nginx content negotiation to serve markdown documentation to LLMs and text-based tools while maintaining HTML for browsers.
Related to ticket: WEB-4447
What's Changed
1. MDX to Markdown Transpilation (9-stage pipeline)
src/pages/docs/to.mdformat during build (~5 seconds)onPostBuildhook after Gatsby page generation# Titleheading<a id="..."/>tags from headings{/* ... */}comments (preserves in code blocks){{API_KEY}}→your-api-key,{{RANDOM_CHANNEL_NAME}}→your-channel-nameTranspilation improvements:
matchAll()instead ofexec()to avoid regex state issues2. Nginx Content Negotiation with Bot Detection
text/markdown,application/markdown, ortext/plaintext/markdownMIME type with gzip compression supportSupported bots:
3. Comprehensive Test Coverage
bin/assert-content-negotiation.shyarn testContent Negotiation Behavior
/docs/channels(browser default)/docs/channels+Accept: text/markdown/docs/channels+Accept: application/markdown/docs/channels+Accept: text/plain/docs/channels+User-Agent: Claude-User/docs/channels+User-Agent: GPTBot/docs/channels+ browser Accept header/docs/channels.md(direct)Files Changed
Created:
data/onPostBuild/transpileMdxToMarkdown.ts(~400 lines) - Core transpilation logicdata/onPostBuild/__fixtures__/input.mdx- Comprehensive test fixturedata/onPostBuild/transpileMdxToMarkdown.test.ts(~250 lines) - 34 Jest unit testsbin/assert-content-negotiation.sh(~160 lines) - 23 integration testsModified:
data/onPostBuild/index.ts- Integrate transpilation into build pipelineconfig/mime.types- Add text/markdown MIME typeconfig/nginx.conf.erb- Bot detection, content negotiation maps, location blocks.circleci/config.yml- Add content negotiation tests to CITest Plan
public/docs/with correct paths.mdaccess worksDeployment Notes
This PR is tagged with
review-appfor deployment to a review environment.All 211 documentation pages are now accessible in both HTML and Markdown formats with intelligent bot detection and content negotiation.
🤖 Generated with Claude Code