Skip to content

Extract; clean, readable text from web pages via Mercury Web Parser.

License

Notifications You must be signed in to change notification settings

CPScript/WebDOM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebDOM Extractor

Python Tests Coverage

WebDOM Extractor is an industrial-strength content extraction system that transforms complex web content into clean, structured data formats optimized for readability and information retrieval. Built on the Postlight Parser engine, WebDOM Extractor delivers pristine text extraction with enterprise-grade reliability, performance, and security.

Key Features

  • Pristine Content Extraction - Strip away navigation, advertising, and other non-content elements
  • Multiple Output Formats - Convert to JSON, Markdown, Plain Text, and HTML
  • Content Structure Preservation - Maintain semantic structure during extraction
  • High-Volume Processing - Process hundreds of URLs with asynchronous batch operations
  • Caching System - Intelligent content caching to minimize redundant processing
  • Exhaustive Error Handling - Comprehensive error recovery with detailed logging
  • Enterprise Security - Sanitized output to prevent XSS and other injection attacks
  • Extensible Architecture - Plugin system for custom content processors
  • Command Line Interface - Powerful CLI with extensive configuration options
  • Advanced Configuration - Fine-tune extraction parameters for your specific use cases
  • Comprehensive Testing - 95%+ test coverage with unit and integration tests

Installation

Prerequisites

  • Python 3.7+
  • Node.js 12+
  • Postlight Parser
# Install Node.js dependencies
npm install -g @postlight/parser

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install Python package
pip install -e .

Quick Start

from webdom_extractor import Extractor

# Extract content from URL
extractor = Extractor()
document = extractor.extract_url("https://example.com/article")

# Get content in different formats
json_data = document.to_json()
markdown = document.to_markdown()
plain_text = document.to_text()

# Save to file
document.save("output.md", format="markdown")

Command Line Usage

# Basic usage
webdom extract https://example.com/article

# Specify output format
webdom extract https://example.com/article --format markdown

# Output to file
webdom extract https://example.com/article --output article.md

# Batch processing from a file list
webdom batch url_list.txt --output-dir ./extracted_content

# With custom configuration
webdom extract https://example.com/article --config custom_config.json

Advanced Configuration

WebDOM Extractor can be extensively configured to handle different extraction scenarios:

{
  "extraction": {
    "preserve_images": true,
    "extract_comments": false,
    "ignore_links": true
  },
  "formatting": {
    "line_width": 80,
    "heading_style": "atx",
    "wrap_blocks": true
  },
  "performance": {
    "cache_enabled": true,
    "cache_ttl": 86400,
    "parallel_requests": 5
  }
}

Enterprise Use Cases

WebDOM Extractor excels in enterprise contexts:

  • Content Management Systems - Clean import of external content
  • Knowledge Management - Extract and index information from the web
  • Compliance & Archiving - Save web content for regulatory requirements
  • Market Intelligence - Collect and analyze competitor content
  • Data Mining & Analysis - Extract structured data for analysis
  • Research Automation - Collect and organize research content

Architecture

WebDOM Extractor is built on a modular architecture:

┌─────────────────┐     ┌───────────────┐     ┌────────────────┐
│ Content Sources │────▶│ Extraction    │────▶│ Post-Processing│
│ - URLs          │     │ - HTML parsing│     │ - Formatting   │
│ - HTML files    │     │ - Content     │     │ - Sanitization │
│ - Web archives  │     │   detection   │     │ - Structure    │
└─────────────────┘     └───────────────┘     └────────────────┘
                                                       │
                                                       ▼
┌─────────────────┐     ┌───────────────┐     ┌────────────────┐
│ Applications    │◀────│ Output        │◀────│ Document Model │
│ - Analytics     │     │ - JSON        │     │ - Metadata     │
│ - Archiving     │     │ - Markdown    │     │ - Content      │
│ - Publishing    │     │ - Plain text  │     │ - Structure    │
└─────────────────┘     └───────────────┘     └────────────────┘

Performance Benchmarks

Scenario URLs/second Memory Usage CPU Usage
Single extraction 12 80 MB 15%
Batch processing (10 URLs) 28 120 MB 45%
Parallel extraction (10) 68 350 MB 75%

Contributing

Contributions are welcome! Please check the CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

About

Extract; clean, readable text from web pages via Mercury Web Parser.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages