-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Feature: Add opt-in telemetry system to help improve Crawl4AI stability through anonymous crash reporting #1420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…tability improvement Implement a privacy-first, provider-agnostic telemetry system to help improve Crawl4AI stability through anonymous crash reporting. The system is designed with user privacy as the top priority, collecting only exception information without any PII, URLs, or crawled content. Architecture & Design: - Provider-agnostic architecture with base TelemetryProvider interface - Sentry as the initial provider implementation with easy extensibility - Separate handling for sync and async code paths - Environment-aware behavior (CLI, Docker, Jupyter/Colab) Key Features: - Opt-in by default for CLI/library usage with interactive consent prompt - Opt-out by default for Docker/API server (enabled unless CRAWL4AI_TELEMETRY=0) - Jupyter/Colab support with widget-based consent (fallback to code snippets) - Persistent consent storage in ~/.crawl4ai/config.json - Optional email collection for critical issue follow-up CLI Integration: - `crwl telemetry enable [--email <email>] [--once]` - Enable telemetry - `crwl telemetry disable` - Disable telemetry - `crwl telemetry status` - Check current status Python API: - Decorators: @telemetry_decorator, @async_telemetry_decorator - Context managers: telemetry_context(), async_telemetry_context() - Manual capture: capture_exception(exc, context) - Control: telemetry.enable(), telemetry.disable(), telemetry.status() Privacy Safeguards: - No URL collection - No request/response data - No authentication tokens or cookies - No crawled content - Automatic sanitization of sensitive fields - Local consent storage only Testing: - Comprehensive test suite with 15 test cases - Coverage for all environments and consent flows - Mock providers for testing without external dependencies Documentation: - Detailed documentation in docs/md_v2/core/telemetry.md - Added to mkdocs navigation under Core section - Privacy commitment and FAQ included - Examples for all usage patterns Installation: - Optional dependency: pip install crawl4ai[telemetry] - Graceful degradation if sentry-sdk not installed - Added to pyproject.toml optional dependencies - Docker requirements updated Integration Points: - AsyncWebCrawler: Automatic exception capture in arun() and aprocess_html() - Docker server: Automatic initialization with environment control - Global exception handler for uncaught exceptions (CLI only) This implementation provides valuable error insights to improve Crawl4AI while maintaining complete transparency and user control over data collection.
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Add opt-in telemetry system to help improve Crawl4AI stability through anonymous crash reporting. This implementation provides a privacy-first, provider-agnostic telemetry infrastructure that captures only
exception information without any PII, URLs, or crawled content.
The telemetry system is designed to be completely transparent and user-controlled, with opt-in behavior for CLI/library usage and opt-out for Docker deployments.
Fix: #1409
List of files changed and why
New files:
crawl4ai/telemetry/__init__.py
- Main telemetry module with manager, decorators, and public APIcrawl4ai/telemetry/base.py
- Provider interface and base classes for extensibilitycrawl4ai/telemetry/config.py
- Configuration management and persistencecrawl4ai/telemetry/consent.py
- User consent handling for different environmentscrawl4ai/telemetry/environment.py
- Runtime environment detection (CLI, Docker, Jupyter)crawl4ai/telemetry/providers/sentry.py
- Sentry provider implementationtests/telemetry/test_telemetry.py
- Comprehensive test suite (15 test cases)docs/md_v2/core/telemetry.md
- Complete telemetry documentationModified files:
crawl4ai/cli.py
- Added telemetry CLI commands (enable/disable/status)crawl4ai/async_webcrawler.py
- Integrated telemetry decorators for exception capturedeploy/docker/server.py
- Added Docker telemetry initializationdeploy/docker/requirements.txt
- Added sentry-sdk dependencypyproject.toml
- Added optional telemetry dependenciesmkdocs.yml
- Added telemetry documentation to navigationHow Has This Been Tested?
Unit Tests: Created comprehensive test suite with 15 test cases covering:
Integration Testing:
crwl telemetry enable/disable/status
)Manual Testing:
~/.crawl4ai/config.json
All tests pass successfully with
pytest tests/telemetry/test_telemetry.py
Checklist: