Web Scraper All in One Insights MCP Server

A Model Context Protocol (MCP) server that opens websites and captures all HTTP requests made by those sites. This tool is perfect for analyzing web traffic patterns, tracking third-party integrations, and understanding the network behavior of websites.

Features

Complete Request Capture: Captures all HTTP requests made by a website including XHR, fetch, and resource requests
Detailed Analysis: Provides comprehensive data including:
- Request URLs, methods, headers
- Response status codes and headers
- Resource types (document, script, stylesheet, image, etc.)
- Timestamps for timing analysis
- Domain analysis showing all external services
Configurable Options:
- Adjustable wait times for dynamic content
- Option to include/exclude image and media requests
- Custom viewport and user agent settings
Built with Patchright: Uses Patchright (enhanced Playwright) for reliable browser automation
Bun Runtime

Requirements

Bun runtime
Google chrome installed
git

Installation

Clone and Install Dependencies

git clone <repository-url>
cd mcp_scraper_analytics
bun install

Test the Server
```
# Run the server
bun src/index.ts
```

Usage

Available Tools

The MCP exposes the following tools. Each entry includes the parameter schema, a concrete input example, and an example output the tool will return.

analyze_website_requests Description: Open a website in a real browser, capture all HTTP requests, and store a site analysis. The ListTools schema documents default client-side hints: waitTime default 3000 ms, includeImages default false, quickMode default false. See the implementation in src/server.ts.

Parameters:

url (string, required): The website URL to analyze (must include http:// or https://)
waitTime (number, optional): Additional wait time in milliseconds for dynamic content (default: 3000, max: 10000)
includeImages (boolean, optional): Whether to include image and media requests (default: false)
quickMode (boolean, optional): Use quick loading mode with minimal waiting (default: false)

Example input:

{
  "url": "https://example.com",
  "waitTime": 3000,
  "includeImages": false,
  "quickMode": false
}

Example output (stored analysis summary):

{
  "websiteInfo": {
    "url": "https://example.com",
    "title": "Example Domain",
    "analysisTimestamp": "2025-08-15T18:00:00.000Z",
    "renderMethod": "client"
  },
  "requestSummary": {
    "totalRequests": 12,
    "uniqueDomains": 4,
    "requestsByType": {
      "document": 1,
      "script": 6,
      "stylesheet": 2,
      "xhr": 2,
      "font": 1
    }
  },
  "domains": [
    "example.com",
    "cdn.example.com",
    "analytics.google.com"
  ],
  "antiBotDetection": {
    "detected": false
  },
  "browserStorage": {
    "cookies": [
      {
        "name": "sessionid",
        "value": "abc123",
        "domain": "example.com",
        "path": "/",
        "expires": 1692136800,
        "httpOnly": true,
        "secure": true,
        "sameSite": "Lax"
      }
    ],
    "localStorage": {
      "theme": "dark",
      "pref": "1"
    },
    "sessionStorage": {
      "lastVisited": "/home"
    }
  }
}

get_requests_by_domain Description: Retrieve all captured requests for a specific domain from a previously stored analysis. This reads the stored analysis results; run analyze_website_requests first. See handler schema in src/handlers/schemas.ts.

Parameters:

url (string, required): The original URL that was analyzed
domain (string, required): The domain to filter requests for (e.g., "api.example.com")

Example input:

{
  "url": "https://example.com",
  "domain": "api.example.com"
}

Example output:

{
  "url": "https://example.com",
  "domain": "api.example.com",
  "totalRequests": 3,
  "requests": [
    {
      "id": "req-1",
      "url": "https://api.example.com/v1/data",
      "method": "GET",
      "resourceType": "xhr",
      "status": 200,
      "timestamp": "2025-08-15T18:00:00.200Z"
    },
    {
      "id": "req-2",
      "url": "https://api.example.com/v1/auth",
      "method": "POST",
      "resourceType": "xhr",
      "status": 201,
      "timestamp": "2025-08-15T18:00:00.450Z"
    }
  ]
}

get_request_details Description: Return full details for a single captured request (headers, body if captured, response, timings). Requires both the analyzed URL and the requestId present in the stored analysis. See the tool registration in src/server.ts.

Parameters:

url (string, required): The original URL that was analyzed
requestId (string, required): The unique ID of the request to retrieve

Example input:

{
  "url": "https://example.com",
  "requestId": "req-2"
}

Example output:

{
  "id": "req-2",
  "url": "https://api.example.com/v1/auth",
  "method": "POST",
  "requestHeaders": {
    "content-type": "application/json"
  },
  "requestBody": "{\"username\":\"user\",\"password\":\"***\"}",
  "status": 201,
  "responseHeaders": {
    "content-type": "application/json"
  },
  "responseBody": "{\"token\":\"abc123\"}",
  "resourceType": "xhr",
  "timestamp": "2025-08-15T18:00:00.450Z",
  "timings": {
    "start": 1692136800.45,
    "end": 1692136800.47,
    "durationMs": 20
  }
}

get_request_summary Description: Return the stored analysis summary for a previously analyzed URL. The handler returns the same summary object produced by analyze_website_requests. See src/handlers/analysis.ts.

Parameters:

url (string, required): The website URL that was previously analyzed

Example input:

{
  "url": "https://example.com"
}

Example output:

{
  "websiteInfo": {
    "url": "https://example.com",
    "title": "Example Domain",
    "analysisTimestamp": "2025-08-15T18:00:00.000Z",
    "renderMethod": "client"
  },
  "requestSummary": {
    "totalRequests": 12,
    "uniqueDomains": 4,
    "requestsByType": {
      "document": 1,
      "script": 6,
      "stylesheet": 2,
      "xhr": 2,
      "font": 1
    }
  },
  "domains": [
    "example.com",
    "cdn.example.com",
    "analytics.google.com"
  ],
  "antiBotDetection": {
    "detected": false
  },
  "browserStorage": {
    "cookies": [
      {
        "name": "sessionid",
        "value": "abc123",
        "domain": "example.com",
        "path": "/",
        "expires": 1692136800,
        "httpOnly": true,
        "secure": true,
        "sameSite": "Lax"
      }
    ],
    "localStorage": {
      "theme": "dark",
      "pref": "1"
    },
    "sessionStorage": {
      "lastVisited": "/home"
    }
  }
}

extract_html_elements Description: Extracts important HTML elements (text, images, links, scripts) with their CSS selectors and basic metadata. Uses page-level extraction logic in src/services/page_analyzer.ts.

Parameters:

url (string, required): The page URL to analyze
filterType (string, required): One of: text, image, link, script

Example input:

{
  "url": "https://example.com",
  "filterType": "text"
}

Example output:

[
  {
    "content": "Hello world",
    "selector": "#intro",
    "type": "text",
    "tag": "p",
    "attributes": {
      "id": "intro",
      "data-test": "x"
    }
  },
  {
    "content": "Title here",
    "selector": ".title",
    "type": "text",
    "tag": "h1",
    "attributes": {}
  }
]

fetch Description: Perform a direct HTTP request (no browser rendering) and return status, headers and body. The schema is defined in src/handlers/schemas.ts.

Parameters:

url (string, required): The URL to fetch
method (string, optional): HTTP method, one of GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS (default: GET)
headers (object, optional): Key-value map of request headers
body (string, optional): Request body for POST/PUT/PATCH

Example input:

{
  "url": "https://api.example.com/data",
  "method": "POST",
  "headers": {
    "Content-Type": "application/json"
  },
  "body": "{\"key\":\"value\"}"
}

Example output:

{
  "status": 200,
  "statusText": "OK",
  "headers": {
    "content-type": "application/json"
  },
  "body": "{\"result\":\"ok\"}",
  "durationMs": 123
}

Integration with AI Assistants

Claude Desktop

Add this to your Claude Desktop MCP settings:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "web-scraper-analytics": {
      "command": "bun",
      "args": ["/path/to/mcp_scraper_analytics/src/index.ts"],
      "disabled": false
    }
  }
}

VSCode with Claude Dev

Edit ~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json:

{
  "mcpServers": {
    "web-scraper-analytics": {
      "command": "bun",
      "args": ["/path/to/mcp_scraper_analytics/src/index.ts"],
      "disabled": false
      "autoApprove": []
    }
  }
}

Use Cases

1. Security Analysis

Identify all third-party services a website connects to
Discover potential data leakage through external requests
Track analytics and advertising integrations

2. Performance Optimization

Analyze request patterns and timing
Identify heavy resource loading
Find opportunities for caching and optimization

3. Competitive Analysis

Understand what services competitors use
Discover third-party integrations and tools
Analyze traffic patterns and dependencies

4. Compliance Auditing

Track external data sharing
Identify GDPR/privacy compliance issues
Document all network communications

5. Development Debugging

Debug API calls and network issues
Analyze timing and performance problems
Understand complex application architectures

Example Queries for AI Assistants

"Analyze the HTTP requests made by https://techcrunch.com and show me all the advertising and analytics services they use"

"Check what third-party services https://shopify.com loads and organize them by category"

"Analyze https://github.com and tell me about their performance characteristics based on the network requests"

"Extract all text and image elements from https://example.com and return their CSS selectors and brief content summaries"

Technical Details

Browser Engine: Chromium via Patchright
Request Capture: Real browser automation with full JavaScript execution
Network Monitoring: Captures both requests and responses
Resource Types: Documents, scripts, stylesheets, images, fonts, XHR, fetch, and more
Timeout Handling: Configurable timeouts for different scenarios
Error Handling: Comprehensive error handling and logging

Development

Scripts

# Build the project
bun run build

# Run in development mode (with auto-rebuild and restart)
bun run dev

# Test with MCP Inspector
bun run mcp-test

# Run unit tests
bun test

# Run tests with coverage
bun run test:coverage

# Run tests in watch mode (for development)
bun run test:watch

# Start production server
bun run start

# Clean build artifacts
bun run clean

Testing

The project includes comprehensive Jest unit tests for the core analyzer functionality.

Test Coverage

66.19% Statement Coverage for analyzer.ts
41.17% Branch Coverage for analyzer.ts
46.15% Function Coverage for analyzer.ts

Test Features

✅ URL Validation: Tests for proper URL format validation
✅ Error Handling: Comprehensive error handling tests
✅ Configuration Options: Tests for all analysis options (waitTime, quickMode, includeImages)
✅ Request Monitoring: Tests for HTTP request capture and filtering
✅ Browser Integration: Mocked browser interactions
✅ Network Timeouts: Graceful handling of network timeouts
✅ Response Processing: Tests for response body capture and truncation
✅ Domain Analysis: Tests for domain extraction and analysis

Running Tests

# Run all tests
bun test

# Run tests with coverage report
bun run test:coverage

# Run tests in watch mode (for development)
bun run test:watch

Project Structure

src/
├── index.ts          # Main MCP server implementation
package.json          # Dependencies and scripts
tsconfig.json         # TypeScript configuration
README.md             # This file

Contributing

Fork the repository
Create a feature branch
Make your changes
Test with the MCP inspector
Submit a pull request

License

This project is licensed under the MIT License.

Troubleshooting

Common Issues

Browser Launch Fails
- Ensure you have proper permissions for browser automation
- On macOS, you might need to grant accessibility permissions
Requests Not Captured
- Some requests might be made before the event listeners are attached
- Try increasing the waitTime parameter
- Check if the site uses complex async loading
Memory Issues
- Large sites with many requests might consume significant memory
- Consider filtering out images and media if not needed
- Close browser contexts properly

Debug Mode

Enable verbose logging by setting the environment variable:

DEBUG=1 bun run dev

Roadmap

Add request/response body capture (for POST requests)
Implement result caching for repeated analyses
Add support for custom browser configurations
Implement request filtering by domain/type
Add export formats (CSV, JSON, XML)
Implement screenshot capture alongside request analysis

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.roo		.roo
src		src
.gitignore		.gitignore
.rooignore		.rooignore
.roomodes		.roomodes
.swcrc		.swcrc
README.md		README.md
bun.lock		bun.lock
jest.config.js		jest.config.js
nodemon.json		nodemon.json
package.json		package.json
tsconfig.json		tsconfig.json

xilistudios/mcp-scraper-aio

Folders and files

Latest commit

History

Repository files navigation

Web Scraper All in One Insights MCP Server

Features

Requirements

Installation

Usage

Available Tools

Integration with AI Assistants

Claude Desktop

VSCode with Claude Dev

Use Cases

1. Security Analysis

2. Performance Optimization

3. Competitive Analysis

4. Compliance Auditing

5. Development Debugging

Example Queries for AI Assistants

Technical Details

Development

Scripts

Testing

Test Coverage

Test Features

Running Tests

Project Structure

Contributing

License

Troubleshooting

Common Issues

Debug Mode

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages