Skip to content

Developing Extractors

SpiffyChatterbox edited this page Feb 21, 2025 · 3 revisions

Developing Gallery-Dl Extractors to support new sites

Table of Contents

  1. Framework Architecture
  2. Basic Concepts
  3. Creating Your First Extractor
  4. Advanced Extractor Development
  5. Common Patterns and Examples
  6. Testing and Debugging
  7. Reference Documentation

Framework Architecture

Overview

Gallery-dl is designed to efficiently download media (primarily images and videos) from various websites. It follows a modular architecture where each website has its own specialized "extractor" component that understands the website's structure and can locate downloadable content from a given URL.

Key Components

  1. Extractors: Python classes that implement website-specific logic to identify and download content.
  2. Message System: Communication mechanism between extractors and the download engine.
  3. Configuration System: Customizes the behavior of extractors through user options.
  4. Download Engine: Handles actual downloading of content identified by extractors.
  5. Archive System: Tracks previously downloaded content to avoid duplicates.

How Extractors Fit In

Extractors are at the heart of the framework and serve as "connectors" between websites and the download engine:

[User Input URL] → [Extractor Selection] → [Appropriate Extractor] → 
  → [Content Extraction] → [Message Queue] → [Download Engine] → [File System]

Each extractor implements a specific pattern recognition mechanism through regular expressions, allowing the framework to route URLs to the appropriate extractor. The extractor then analyzes the page, identifies downloadable content, and passes this information to the download engine through a standardized message system.


Basic Concepts

Extractor Class Hierarchy

Extractors are organized in a class hierarchy that promotes code reuse:

  • Extractor: Base class for all extractors, handles common functionality
    • GalleryExtractor: For image galleries with multiple images
      • ChapterExtractor: Specialized for manga/comic chapters
    • MangaExtractor: For manga series with multiple chapters
    • AsynchronousMixin: Adds asynchronous capability to extractors
    • BaseExtractor: For handling multiple domains with similar structures

This hierarchy allows specialized behavior while sharing common functionality.

Message System

The framework uses a message-based system to communicate between components:

  • Message.Directory: Specifies the destination directory for subsequent downloads
  • Message.Url: Indicates a resource to be downloaded
  • Message.Queue: Schedules another URL to be processed by a different extractor
  • Message.Version: Specifies the message protocol version

Extractors yield these messages to communicate with the download engine.

Configuration System

The framework employs a hierarchical configuration system:

  • Global default configuration
  • Category-specific configuration (e.g., for all "flickr" extractors)
  • Subcategory-specific configuration (e.g., for "flickr:user" extractors)
  • Instance-specific configuration (for a specific URL)

Extractors can access their configuration through the config() method.

Basic Extractor Structure

A minimal extractor requires:

  1. A class that inherits from Extractor
  2. A pattern attribute with a regular expression to match URLs
  3. A category and subcategory designation
  4. An items() method that yields appropriate messages
from .common import Extractor, Message
from .. import text

class SimpleExtractor(Extractor):
    category = "example"
    subcategory = "simple"
    pattern = r"https?://example\.com/(\d+)"
    
    def items(self):
        yield Message.Version, 1
        
        # Get page content
        page = self.request(self.url).text
        
        # Extract data
        data = {"id": self.match.group(1)}
        image_url = "https://example.com/image.jpg"
        
        # Yield directory info
        yield Message.Directory, data
        
        # Yield image URL
        yield Message.Url, image_url, data

Creating Your First Extractor

Step 1: Setting Up

Before creating an extractor, understand what content you want to extract:

  1. Identify the website's URL pattern
  2. Determine how to access the content (direct HTML, API, etc.)
  3. Understand how the website organizes its content

Step 2: Creating the Basic Structure

Let's create a simple extractor for a fictional image hosting site "imagex.com":

# gallery_dl/extractor/imagex.py
from .common import Extractor, Message
from .. import text

class ImagexExtractor(Extractor):
    """Base class for imagex extractors"""
    category = "imagex"
    root = "https://imagex.com"

Step 3: Implementing a Single Image Extractor

class ImagexImageExtractor(ImagexExtractor):
    """Extractor for single images from imagex.com"""
    subcategory = "image"
    pattern = r"(?:https?://)?(?:www\.)?imagex\.com/image/([a-zA-Z0-9]+)"
    filename_fmt = "{category}_{id}.{extension}"
    archive_fmt = "{id}"
    
    def __init__(self, match):
        ImagexExtractor.__init__(self, match)
        self.image_id = match.group(1)
    
    def items(self):
        url = f"{self.root}/image/{self.image_id}"
        page = self.request(url).text
        
        # Extract image URL using text.extract function
        image_url = text.extract(page, '<img src="', '"')[0]
        
        # Prepare metadata
        data = {
            "id": self.image_id,
            "url": image_url,
        }
        text.nameext_from_url(image_url, data)
        
        yield Message.Directory, data
        yield Message.Url, image_url, data

Step 4: Implementing a Gallery Extractor

class ImagexGalleryExtractor(ImagexExtractor):
    """Extractor for image galleries from imagex.com"""
    subcategory = "gallery"
    directory_fmt = ("{category}", "{gallery_id} {title}")
    filename_fmt = "{category}_{gallery_id}_{num:>03}.{extension}"
    archive_fmt = "{gallery_id}_{id}"
    pattern = r"(?:https?://)?(?:www\.)?imagex\.com/gallery/([a-zA-Z0-9]+)"
    
    def __init__(self, match):
        ImagexExtractor.__init__(self, match)
        self.gallery_id = match.group(1)
    
    def items(self):
        url = f"{self.root}/gallery/{self.gallery_id}"
        page = self.request(url).text
        
        # Extract gallery title
        title = text.extract(page, '<h1>', '</h1>')[0]
        
        # Extract all image URLs
        gallery_data = {
            "gallery_id": self.gallery_id,
            "title": title or self.gallery_id,
        }
        
        yield Message.Directory, gallery_data
        
        # Find all image containers
        image_containers = text.extract_iter(page, '<div class="image-container">', '</div>')
        
        for num, container in enumerate(image_containers, 1):
            # Extract image URL and ID
            image_url = text.extract(container, 'src="', '"')[0]
            image_id = text.extract(container, 'data-id="', '"')[0]
            
            # Prepare image metadata
            data = {
                "gallery_id": self.gallery_id,
                "id": image_id,
                "num": num,
            }
            text.nameext_from_url(image_url, data)
            
            # Add gallery metadata
            data.update(gallery_data)
            
            yield Message.Url, image_url, data

Step 5: Adding to Framework

Add your extractor to the module list in gallery_dl/extractor/__init__.py:

# gallery_dl/extractor/__init__.py
modules = [
    # ...
    "imagex",
    # ...
]

Step 6: Testing

Test your extractor with a URL:

$ gallery-dl -v "https://imagex.com/gallery/abc123"

Advanced Extractor Development

Working with Website APIs

Many websites offer APIs that provide data in structured formats like JSON. Here's an example using Flickr's API:

class FlickrAPIClient:
    """Minimal interface for the Flickr API"""
    
    API_URL = "https://api.flickr.com/services/rest/"
    API_KEY = "your_api_key"
    
    def __init__(self, extractor):
        self.extractor = extractor
        
    def photos_getInfo(self, photo_id):
        """Get information about a photo"""
        params = {
            "method": "flickr.photos.getInfo",
            "photo_id": photo_id,
            "api_key": self.API_KEY,
            "format": "json",
            "nojsoncallback": "1",
        }
        return self.extractor.request(self.API_URL, params=params).json()["photo"]

Authentication and Session Handling

Some websites require authentication to access content:

def login(self):
    """Login and set necessary cookies"""
    username, password = self._get_auth_info()
    if username:
        self.log.info("Logging in as %s", username)
        
        url = self.root + "/login"
        data = {
            "username": username,
            "password": password,
            "remember": "1",
        }
        
        response = self.request(url, method="POST", data=data)
        if not response.cookies.get("sessionid"):
            raise exception.AuthenticationError("Login failed")
        
        return True
    return False

Handling Pagination

Many websites implement pagination for content spanning multiple pages:

def images(self, page):
    """Return all image URLs from a paginated gallery"""
    url = self.gallery_url
    images = []
    page_num = 1
    
    while True:
        self.log.info("Downloading page %d", page_num)
        response = self.request(url)
        
        # Extract images from current page
        page_images = self._extract_images_from_page(response.text)
        images.extend(page_images)
        
        # Look for next page link
        next_url = text.extract(response.text, 'class="next" href="', '"')[0]
        if not next_url:
            return images
            
        url = self.root + next_url
        page_num += 1

Handling Lazy Loading

Some websites load images dynamically using JavaScript:

def _extract_images_from_page(self, page):
    """Extract both static and lazy-loaded images"""
    images = []
    
    # Extract static images
    for url in text.extract_iter(page, '<img src="', '"'):
        if "/placeholder.jpg" not in url:
            images.append(url)
    
    # Extract lazy-loaded images
    for url in text.extract_iter(page, 'data-src="', '"'):
        if url not in images:
            images.append(url)
            
    return images

Common Patterns and Examples

Example 1: Image Gallery Extractor (like Desktopography)

For websites primarily focused on image galleries:

class DesktopographyExhibitionExtractor(DesktopographyExtractor):
    """Extractor for a yearly desktopography exhibition"""
    subcategory = "exhibition"
    pattern = r"https?://desktopography\.net/exhibition-([^/?#]+)/"
    
    def __init__(self, match):
        DesktopographyExtractor.__init__(self, match)
        self.year = match.group(1)
    
    def items(self):
        url = "{}/exhibition-{}/".format(self.root, self.year)
        base_entry_url = "https://desktopography.net/portfolios/"
        page = self.request(url).text
        
        data = {
            "_extractor": DesktopographyEntryExtractor,
            "year": self.year,
        }
        
        for entry_url in text.extract_iter(
                page,
                '<a class="overlay-background" href="' + base_entry_url,
                '">'):
            
            url = base_entry_url + entry_url
            yield Message.Queue, url, data

Example 2: Blog Post Extractor (like Blogger)

For websites with blog posts containing images:

class BloggerPostExtractor(BloggerExtractor):
    """Extractor for a single blog post"""
    subcategory = "post"
    pattern = r"[\w-]+\.blogspot\.com(/\d\d\d\d/\d\d/[^/?#]+\.html)"
    
    def __init__(self, match):
        BloggerExtractor.__init__(self, match)
        self.path = match.group(match.lastindex)
    
    def posts(self, blog):
        return (self.api.post_by_path(blog["id"], self.path),)

Example 3: API-Based Extractor (like Flickr)

For websites with comprehensive APIs:

class FlickrImageExtractor(FlickrExtractor):
    """Extractor for individual images from flickr.com"""
    subcategory = "image"
    pattern = r"(?:https?://)?(?:www\.|secure\.|m\.)?flickr\.com/photos/[^/?#]+/(\d+)"
    
    def items(self):
        photo = self.api.photos_getInfo(self.item_id)
        
        self.api._extract_metadata(photo)
        if photo["media"] == "video" and self.api.videos:
            self.api._extract_video(photo)
        else:
            self.api._extract_photo(photo)
        
        photo["user"] = photo["owner"]
        photo["title"] = photo["title"]["_content"]
        photo["comments"] = text.parse_int(photo["comments"]["_content"])
        photo["description"] = photo["description"]["_content"]
        photo["date"] = text.parse_timestamp(photo["dateuploaded"])
        photo["id"] = text.parse_int(photo["id"])
        
        url = self._file_url(photo)
        yield Message.Directory, photo
        yield Message.Url, url, text.nameext_from_url(url, photo)

Example 4: File Hosting Extractor (like Catbox)

For simple file hosting websites:

class CatboxFileExtractor(Extractor):
    """Extractor for catbox files"""
    category = "catbox"
    subcategory = "file"
    archive_fmt = "{filename}"
    pattern = r"(?:https?://)?(?:files|litter|de)\.catbox\.moe/([^/?#]+)"
    
    def items(self):
        url = text.ensure_http_scheme(self.url)
        file = text.nameext_from_url(url, {"url": url})
        yield Message.Directory, file
        yield Message.Url, url, file

Testing and Debugging

Test Framework

The framework includes a test system to validate extractor functionality:

# test/results/imagex.py
# Import your extractor:
from gallery_dl.extractor import imagex

# Create test cases and assert the outcome of that test. 
{
    "#url"     : "https://www.imagex.com/image/testimage2239",
     "#id"     : "testimage2239"
},

Common Issues and Solutions

1. URL Pattern Not Matching

Issue: Extractor not being recognized for a URL

Solution: Test your regex pattern separately:

import re
pattern = r"(?:https?://)?example\.com/gallery/(\d+)"
url = "https://example.com/gallery/123"
match = re.match(pattern, url)
print(bool(match), match.groups() if match else None)

2. Element Not Found

Issue: text.extract() returns None or empty string

Solution: Print the page content to see actual structure:

def items(self):
    page = self.request(self.url).text
    with open("debug.html", "w", encoding="utf-8") as f:
        f.write(page)
    
    # Continue with extraction...

Debugging Techniques

  1. Enable Verbose Logging:

    $ gallery-dl -v URL
  2. Dump HTTP Responses:

    def __init__(self, match):
        Extractor.__init__(self, match)
        self._write_pages = True
  3. Examine Request Headers:

    def items(self):
        response = self.request(self.url)
        print("Request Headers:", response.request.headers)
        print("Response Headers:", response.headers)

Reference Documentation

Base Classes

Extractor

The base class for all extractors.

Attributes:

  • category: Site identifier (e.g., "flickr")
  • subcategory: Content type (e.g., "image", "gallery")
  • pattern: Regular expression to match URLs
  • directory_fmt: Format string for directory names
  • filename_fmt: Format string for file names
  • archive_fmt: Format string for archive entries

Methods:

  • items(): Yields messages for downloading
  • request(url, ...): Makes HTTP requests
  • config(key, default=None): Gets configuration values
  • log.info/debug/warning/error(...): Logging functions

GalleryExtractor

Base class for gallery extractors.

Methods:

  • metadata(page): Returns gallery metadata
  • images(page): Returns a list of image URLs and metadata

ChapterExtractor

Specialized extractor for manga/comic chapters.

MangaExtractor

Extractor for manga series with multiple chapters.

Message Types

  • Message.Version: Protocol version identifier
  • Message.Directory: Directory information for subsequent files
  • Message.Url: URL to be downloaded
  • Message.Queue: URL to be processed by another extractor

Utility Functions

  • text.extract(text, start, end): Extracts text between markers
  • text.extract_iter(text, start, end): Iterates over all matches
  • text.nameext_from_url(url, data=None): Extracts filename and extension
  • text.parse_int(string, default=0): Converts string to integer
  • text.parse_timestamp(string): Converts timestamp string to datetime

Configuration Options

Common configuration options for extractors include:

  • username / password: Login credentials
  • cookies: Cookies for authenticated sessions
  • retries: Number of times to retry failed requests
  • sleep-request: Time to wait between requests
  • timeout: Request timeout in seconds
  • proxy: Proxy server to use
  • verify: Whether to verify SSL certificates
Clone this wiki locally