Skip to content

Conversation

@kennethkalmer
Copy link
Member

@kennethkalmer kennethkalmer commented Dec 3, 2025

Summary

Implements automatic MDX-to-Markdown transpilation during the Gatsby build process and adds nginx content negotiation to serve markdown documentation to LLMs and text-based tools while maintaining HTML for browsers.

Related to ticket: WEB-4447

What's Changed

1. MDX to Markdown Transpilation (9-stage pipeline)

  • Transpiles all 211 MDX files under src/pages/docs/ to .md format during build (~5 seconds)
  • Runs in onPostBuild hook after Gatsby page generation
  • Stage 1: Parse frontmatter - keeps only title, converts to # Title heading
  • Stage 2: Remove import/export statements - handles both single and multi-line
  • Stage 3: Remove script tags (except in code blocks) - strips JSON-LD schemas
  • Stage 4: Remove anchor tags - strips <a id="..."/> tags from headings
  • Stage 5: Remove JSX comments - strips {/* ... */} comments (preserves in code blocks)
  • Stage 6: Convert image paths to GitHub URLs - handles relative, absolute, and direct paths
  • Stage 7: Convert relative URLs to absolute - uses siteUrl from GraphQL
  • Stage 8: Replace template variables - {{API_KEY}}your-api-key, {{RANDOM_CHANNEL_NAME}}your-channel-name
  • Stage 9: Prepend title as markdown heading

Transpilation improvements:

  • Uses matchAll() instead of exec() to avoid regex state issues
  • Handles edge cases (top-level docs/index.mdx, multi-line imports/exports)
  • Image paths validated with file extensions (.png, .jpg, .svg, etc.)
  • Queries siteUrl from Gatsby GraphQL for resilience
  • Comprehensive error handling with detailed logging

2. Nginx Content Negotiation with Bot Detection

  • Accept header detection: Serves markdown for text/markdown, application/markdown, or text/plain
  • Bot detection: Identifies LLM bots by User-Agent (Claude, ChatGPT, Perplexity, Google AI)
  • Combined logic: Serves markdown if EITHER bot detected OR markdown requested via Accept
  • Conservative bot list: Only explicit bot names (11 patterns), no generic HTTP libraries
  • Added text/markdown MIME type with gzip compression support
  • Uses nginx map blocks for optimal performance (no if statements)
  • Maintains backward compatibility (HTML is default for browsers)
  • Works with existing authentication system

Supported bots:

  • Anthropic: Claude-User, ClaudeBot, anthropic-ai
  • OpenAI: ChatGPT-User, GPTBot
  • Perplexity: PerplexityBot, Perplexity-User
  • Google AI: Google-Extended, GoogleOther, Gemini

3. Comprehensive Test Coverage

  • 34 Jest unit tests with fixtures and snapshot testing
    • Full transformation fixture test
    • Individual tests for all 9 transformation stages
    • Edge case tests (missing title, top-level index, code block preservation)
  • 23 integration tests via bin/assert-content-negotiation.sh
    • Accept header negotiation (6 tests)
    • Browser behavior (2 tests)
    • Direct access (2 tests)
    • Path variations (3 tests)
    • Edge cases (3 tests)
    • Bot detection via User-Agent (6 tests)
    • Combined bot + Accept header (2 tests)
  • All tests passing locally and integrated with yarn test

Content Negotiation Behavior

Request Response
/docs/channels (browser default) HTML
/docs/channels + Accept: text/markdown Markdown
/docs/channels + Accept: application/markdown Markdown
/docs/channels + Accept: text/plain Markdown
/docs/channels + User-Agent: Claude-User Markdown (even without Accept header)
/docs/channels + User-Agent: GPTBot Markdown
/docs/channels + browser Accept header HTML
/docs/channels.md (direct) Markdown

Files Changed

Created:

  • data/onPostBuild/transpileMdxToMarkdown.ts (~400 lines) - Core transpilation logic
  • data/onPostBuild/__fixtures__/input.mdx - Comprehensive test fixture
  • data/onPostBuild/transpileMdxToMarkdown.test.ts (~250 lines) - 34 Jest unit tests
  • bin/assert-content-negotiation.sh (~160 lines) - 23 integration tests

Modified:

  • data/onPostBuild/index.ts - Integrate transpilation into build pipeline
  • config/mime.types - Add text/markdown MIME type
  • config/nginx.conf.erb - Bot detection, content negotiation maps, location blocks
  • .circleci/config.yml - Add content negotiation tests to CI

Test Plan

  • All 211 MDX files transpile successfully
  • Markdown files in public/docs/ with correct paths
  • All transformations verified (anchors, comments, images, URLs, templates)
  • Content negotiation works for all Accept headers
  • Bot detection works for all LLM User-Agents
  • Browsers get HTML (backward compatible)
  • Direct .md access works
  • 34 Jest unit tests passing
  • 23 integration tests passing
  • CI tests pass
  • Review app deployment successful
  • Manual LLM testing in review app

Deployment Notes

This PR is tagged with review-app for deployment to a review environment.

All 211 documentation pages are now accessible in both HTML and Markdown formats with intelligent bot detection and content negotiation.

🤖 Generated with Claude Code

@kennethkalmer kennethkalmer added the review-app Create a Heroku review app label Dec 3, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 3, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch WEB-4447-transpile-mdx-to-md

Comment @coderabbitai help to get the list of available commands and usage tips.

@ably-ci ably-ci had a problem deploying to ably-docs-web-4447-tran-bjkybb December 3, 2025 14:42 Failure
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-bjkybb December 3, 2025 14:54 Inactive
@ably-ci ably-ci temporarily deployed to ably-docs-web-4447-tran-mqes5f December 3, 2025 16:52 Inactive
@kennethkalmer kennethkalmer had a problem deploying to ably-docs-web-4447-tran-mqes5f December 3, 2025 22:34 Failure
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 4, 2025 09:06 Inactive
@kennethkalmer kennethkalmer self-assigned this Dec 4, 2025
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 4, 2025 10:28 Inactive
@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from de0ba4d to f227aa4 Compare December 4, 2025 10:54
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 4, 2025 10:54 Inactive
@kennethkalmer kennethkalmer marked this pull request as ready for review December 4, 2025 11:46
Copy link
Contributor

@m-hulbert m-hulbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM (under the assumption that links contain ably-dev because it's a review app. There's a few really minor changes we could address, but I think that's more source-based than the .md generation. 🚀

@kennethkalmer
Copy link
Member Author

@m-hulbert thanks! let me get the code ready for team review then :)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements automatic MDX-to-Markdown transpilation during the Gatsby build process and adds nginx content negotiation to serve markdown documentation to LLMs and text-based tools while maintaining HTML for browsers.

Key Changes:

  • New transpilation module that converts MDX files to markdown format with content transformations (frontmatter removal, import/export stripping, template variable replacement, etc.)
  • Nginx content negotiation using Accept header mapping to serve .md or .html files based on client preferences
  • Comprehensive test suite with 16 test cases for content negotiation behavior

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
data/onPostBuild/transpileMdxToMarkdown.ts Core transpilation logic with transformation functions for MDX-to-Markdown conversion
data/onPostBuild/index.ts Integrates transpilation into build pipeline
config/nginx.conf.erb Implements content negotiation via map blocks and updates try_files directives
config/mime.types Adds text/markdown MIME type support
bin/assert-content-negotiation.sh Test suite for verifying content negotiation behavior
.circleci/config.yml Adds content negotiation tests to CI pipeline

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going through the code and preview app.
Give me some time to review this and test against LLM.
Meanwhile you can check @copilot review comments.

@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from f227aa4 to f251b2b Compare December 5, 2025 17:26
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 5, 2025 17:26 Inactive
@kennethkalmer
Copy link
Member Author

@sacOO7 thanks, passed the comments back to Claudius for the appropriate treatment, and I also added some unit tests.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Dec 6, 2025

@sacOO7 I've opened a new pull request, #3009, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR assumes that the LLM crawler or web fetch requests content using the following headers: text/markdown, application/markdown, and text/plain. If the default configuration for these crawlers and web fetch is different, then data will be returned in HTML format.

@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from 4a9f682 to d56f83e Compare December 9, 2025 11:09
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 9, 2025 11:09 Inactive
@kennethkalmer kennethkalmer requested review from a team and jamiehenson December 9, 2025 11:19

// Report summary
if (failureCount > 0) {
reporter.warn(
Copy link
Contributor

@sacOO7 sacOO7 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
reporter.warn(
reporter.panicOnBuild(

If any part of the Transpilation to Markdown step fails, it should throw an error and the build should fail, right?
Otherwise, we risk introducing inconsistencies in the Markdown files that will be served to the LLMs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, while we solidify this, I think having graceful degradation and falling back to an HTML response is better.

Copy link
Contributor

@sacOO7 sacOO7 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍
You can create a separate issue to track/handle failing markdowns so they can be addressed in the near future.

Copy link
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did research using Opus 4.5
Based on Cloudflare's July 2025 data, the AI crawler market share breakdown

Crawler Share (July 2025) In Your Config?
GPTBot 11.7%
ClaudeBot ~10%
Meta-ExternalAgent 7.5% ❌ Missing
Amazonbot 5.9%
Bytespider 2.4%
PerplexityBot ~3%
Google-Extended ~5%

Current config covers roughly 30% of AI crawler traffic.
Adding Meta-ExternalAgent, Amazonbot and Bytespider should cover ~45% of all traffic.

Copy link
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from above comments, looks good 👍

@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from d56f83e to 7515c07 Compare December 9, 2025 16:28
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 9, 2025 16:29 Inactive
@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from 7515c07 to 17d0867 Compare December 9, 2025 22:26
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 9, 2025 22:26 Inactive
@kennethkalmer
Copy link
Member Author

I spent a bit of time trimming some fat and statistics from the commit messages as well

Copy link
Contributor

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kennethkalmer and others added 5 commits December 11, 2025 09:58
Implements automatic transpilation of MDX documentation files to
Markdown format during the Gatsby build process. The transpiled Markdown
files are generated in the public/docs/ directory alongside HTML output,
making documentation accessible to LLMs and other text-based tools.

Key features:
- Transpiles all MDX files under src/pages/docs/ to .md format
- Removes frontmatter except title (converted to # heading)
- Removes import/export statements and script tags
- Replaces template variables ({{API_KEY}}, {{RANDOM_CHANNEL_NAME}})
- Preserves JSX components and code blocks as-is
- Smart path mapping: index.mdx → parent.md, file.mdx → file.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements HTTP content negotiation in nginx to serve markdown versions
of documentation pages based on the Accept header. This allows clients
(like LLMs and text-based tools) to request markdown format while
browsers continue to receive HTML by default.

Key features:
- Serves markdown for Accept: text/markdown, application/markdown, or
text/plain
- Maintains backward compatibility (HTML is default)
- Works with existing authentication system
- Supports both index and non-index file paths
- No performance impact (uses nginx map blocks)

Content negotiation behavior:
- /docs/channels with Accept: text/markdown → serves docs/channels.md
- /docs/channels with Accept: text/html → serves
docs/channels/index.html
- /docs/channels (browser default) → serves docs/channels/index.html
- /docs/channels.md (direct access) → serves docs/channels.md

Implementation:
- Added text/markdown MIME type to config/mime.types
- Added text/markdown to gzip_types for compression
- Created map blocks to detect Accept header preferences
- Updated location blocks to use content-negotiated file paths
- Fallback to HTML when markdown doesn't exist

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds comprehensive CI test suite to verify content negotiation works
correctly for all Accept header scenarios. The test suite validates that
nginx serves markdown or HTML based on the Accept header, with proper
fallback behavior.

Test coverage:
- Basic content negotiation: text/markdown, application/markdown,
text/plain, text/html, */*
- Browser behavior: Complex Accept headers, HTML priority when listed
first
- Direct access: .md and .html file access
- Path variations: Index paths, non-index paths, nested paths
- Edge cases: 404 handling, fallback behavior, non-docs paths

Also fixes nginx map priority order to ensure anchored patterns
(^text/html, ^text/plain) are evaluated before wildcard patterns. This
ensures "text/html, text/markdown" correctly serves HTML instead of
markdown.

Changes:
- Created bin/assert-content-negotiation.sh with run_test() helper
function
- Integrated test into CircleCI test-nginx job
- Reordered nginx map patterns for correct priority matching

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Enhances content negotiation to serve markdown to LLM bots based on
User-Agent strings, in addition to Accept header detection. This ensures
bots like Claude, ChatGPT, Perplexity, and Google AI get markdown even
if they don't send proper Accept headers.

Bot detection:
- Detects several LLM bot User-Agents (Claude, ChatGPT, Perplexity,
Google AI)
- Conservative list - no generic HTTP libraries to avoid false positives
- Combines with existing Accept header logic using nginx map variables
- Serves markdown if EITHER bot detected OR Accept header requests
markdown

Implementation:
- Added $is_llm_bot map for User-Agent pattern matching
- Updated $docs_file_extension map to combine bot + Accept header
detection
- Uses map variable concatenation:
"${is_llm_bot}${wants_markdown_via_accept}"
- Works seamlessly with existing try_files logic

Testing:
- Added new tests for bot User-Agent detection
- Tests bot override behavior (bot gets markdown even with Accept:
text/html)
- Verified browsers still get HTML by default

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@kennethkalmer kennethkalmer force-pushed the WEB-4447-transpile-mdx-to-md branch from 17d0867 to 40b46d7 Compare December 11, 2025 09:58
@kennethkalmer kennethkalmer temporarily deployed to ably-docs-web-4447-tran-mqes5f December 11, 2025 09:58 Inactive
Copy link
Member

@matt423 matt423 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly tested and reviewed more on a surface level but works as expected in browser, direct .md request and Copilot got markdown by default too.

@kennethkalmer kennethkalmer merged commit ef6c96b into main Dec 11, 2025
7 checks passed
@kennethkalmer kennethkalmer deleted the WEB-4447-transpile-mdx-to-md branch December 11, 2025 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-app Create a Heroku review app

Development

Successfully merging this pull request may close these issues.

7 participants