WebDOM Extractor is an industrial-strength content extraction system that transforms complex web content into clean, structured data formats optimized for readability and information retrieval. Built on the Postlight Parser engine, WebDOM Extractor delivers pristine text extraction with enterprise-grade reliability, performance, and security.
- Pristine Content Extraction - Strip away navigation, advertising, and other non-content elements
- Multiple Output Formats - Convert to JSON, Markdown, Plain Text, and HTML
- Content Structure Preservation - Maintain semantic structure during extraction
- High-Volume Processing - Process hundreds of URLs with asynchronous batch operations
- Caching System - Intelligent content caching to minimize redundant processing
- Exhaustive Error Handling - Comprehensive error recovery with detailed logging
- Enterprise Security - Sanitized output to prevent XSS and other injection attacks
- Extensible Architecture - Plugin system for custom content processors
- Command Line Interface - Powerful CLI with extensive configuration options
- Advanced Configuration - Fine-tune extraction parameters for your specific use cases
- Comprehensive Testing - 95%+ test coverage with unit and integration tests
- Python 3.7+
- Node.js 12+
- Postlight Parser
# Install Node.js dependencies
npm install -g @postlight/parser
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install Python package
pip install -e .
from webdom_extractor import Extractor
# Extract content from URL
extractor = Extractor()
document = extractor.extract_url("https://example.com/article")
# Get content in different formats
json_data = document.to_json()
markdown = document.to_markdown()
plain_text = document.to_text()
# Save to file
document.save("output.md", format="markdown")
# Basic usage
webdom extract https://example.com/article
# Specify output format
webdom extract https://example.com/article --format markdown
# Output to file
webdom extract https://example.com/article --output article.md
# Batch processing from a file list
webdom batch url_list.txt --output-dir ./extracted_content
# With custom configuration
webdom extract https://example.com/article --config custom_config.json
WebDOM Extractor can be extensively configured to handle different extraction scenarios:
{
"extraction": {
"preserve_images": true,
"extract_comments": false,
"ignore_links": true
},
"formatting": {
"line_width": 80,
"heading_style": "atx",
"wrap_blocks": true
},
"performance": {
"cache_enabled": true,
"cache_ttl": 86400,
"parallel_requests": 5
}
}
WebDOM Extractor excels in enterprise contexts:
- Content Management Systems - Clean import of external content
- Knowledge Management - Extract and index information from the web
- Compliance & Archiving - Save web content for regulatory requirements
- Market Intelligence - Collect and analyze competitor content
- Data Mining & Analysis - Extract structured data for analysis
- Research Automation - Collect and organize research content
WebDOM Extractor is built on a modular architecture:
┌─────────────────┐ ┌───────────────┐ ┌────────────────┐
│ Content Sources │────▶│ Extraction │────▶│ Post-Processing│
│ - URLs │ │ - HTML parsing│ │ - Formatting │
│ - HTML files │ │ - Content │ │ - Sanitization │
│ - Web archives │ │ detection │ │ - Structure │
└─────────────────┘ └───────────────┘ └────────────────┘
│
▼
┌─────────────────┐ ┌───────────────┐ ┌────────────────┐
│ Applications │◀────│ Output │◀────│ Document Model │
│ - Analytics │ │ - JSON │ │ - Metadata │
│ - Archiving │ │ - Markdown │ │ - Content │
│ - Publishing │ │ - Plain text │ │ - Structure │
└─────────────────┘ └───────────────┘ └────────────────┘
Scenario | URLs/second | Memory Usage | CPU Usage |
---|---|---|---|
Single extraction | 12 | 80 MB | 15% |
Batch processing (10 URLs) | 28 | 120 MB | 45% |
Parallel extraction (10) | 68 | 350 MB | 75% |
Contributions are welcome! Please check the CONTRIBUTING.md for guidelines.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Postlight Parser for the underlying parsing engine
- HTML2Text for HTML to text conversion