-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Developing Extractors
- Framework Architecture
- Basic Concepts
- Creating Your First Extractor
- Advanced Extractor Development
- Common Patterns and Examples
- Testing and Debugging
- Reference Documentation
Gallery-dl is designed to efficiently download media (primarily images and videos) from various websites. It follows a modular architecture where each website has its own specialized "extractor" component that understands the website's structure and can locate downloadable content from a given URL.
- Extractors: Python classes that implement website-specific logic to identify and download content.
- Message System: Communication mechanism between extractors and the download engine.
- Configuration System: Customizes the behavior of extractors through user options.
- Download Engine: Handles actual downloading of content identified by extractors.
- Archive System: Tracks previously downloaded content to avoid duplicates.
Extractors are at the heart of the framework and serve as "connectors" between websites and the download engine:
[User Input URL] → [Extractor Selection] → [Appropriate Extractor] →
→ [Content Extraction] → [Message Queue] → [Download Engine] → [File System]
Each extractor implements a specific pattern recognition mechanism through regular expressions, allowing the framework to route URLs to the appropriate extractor. The extractor then analyzes the page, identifies downloadable content, and passes this information to the download engine through a standardized message system.
Extractors are organized in a class hierarchy that promotes code reuse:
-
Extractor: Base class for all extractors, handles common functionality
-
GalleryExtractor: For image galleries with multiple images
- ChapterExtractor: Specialized for manga/comic chapters
- MangaExtractor: For manga series with multiple chapters
- AsynchronousMixin: Adds asynchronous capability to extractors
- BaseExtractor: For handling multiple domains with similar structures
-
GalleryExtractor: For image galleries with multiple images
This hierarchy allows specialized behavior while sharing common functionality.
The framework uses a message-based system to communicate between components:
-
Message.Directory
: Specifies the destination directory for subsequent downloads -
Message.Url
: Indicates a resource to be downloaded -
Message.Queue
: Schedules another URL to be processed by a different extractor -
Message.Version
: Specifies the message protocol version
Extractors yield these messages to communicate with the download engine.
The framework employs a hierarchical configuration system:
- Global default configuration
- Category-specific configuration (e.g., for all "flickr" extractors)
- Subcategory-specific configuration (e.g., for "flickr:user" extractors)
- Instance-specific configuration (for a specific URL)
Extractors can access their configuration through the config()
method.
A minimal extractor requires:
- A class that inherits from
Extractor
- A
pattern
attribute with a regular expression to match URLs - A category and subcategory designation
- An
items()
method that yields appropriate messages
from .common import Extractor, Message
from .. import text
class SimpleExtractor(Extractor):
category = "example"
subcategory = "simple"
pattern = r"https?://example\.com/(\d+)"
def items(self):
yield Message.Version, 1
# Get page content
page = self.request(self.url).text
# Extract data
data = {"id": self.match.group(1)}
image_url = "https://example.com/image.jpg"
# Yield directory info
yield Message.Directory, data
# Yield image URL
yield Message.Url, image_url, data
Before creating an extractor, understand what content you want to extract:
- Identify the website's URL pattern
- Determine how to access the content (direct HTML, API, etc.)
- Understand how the website organizes its content
Let's create a simple extractor for a fictional image hosting site "imagex.com":
# gallery_dl/extractor/imagex.py
from .common import Extractor, Message
from .. import text
class ImagexExtractor(Extractor):
"""Base class for imagex extractors"""
category = "imagex"
root = "https://imagex.com"
class ImagexImageExtractor(ImagexExtractor):
"""Extractor for single images from imagex.com"""
subcategory = "image"
pattern = r"(?:https?://)?(?:www\.)?imagex\.com/image/([a-zA-Z0-9]+)"
filename_fmt = "{category}_{id}.{extension}"
archive_fmt = "{id}"
def __init__(self, match):
ImagexExtractor.__init__(self, match)
self.image_id = match.group(1)
def items(self):
url = f"{self.root}/image/{self.image_id}"
page = self.request(url).text
# Extract image URL using text.extract function
image_url = text.extract(page, '<img src="', '"')[0]
# Prepare metadata
data = {
"id": self.image_id,
"url": image_url,
}
text.nameext_from_url(image_url, data)
yield Message.Directory, data
yield Message.Url, image_url, data
class ImagexGalleryExtractor(ImagexExtractor):
"""Extractor for image galleries from imagex.com"""
subcategory = "gallery"
directory_fmt = ("{category}", "{gallery_id} {title}")
filename_fmt = "{category}_{gallery_id}_{num:>03}.{extension}"
archive_fmt = "{gallery_id}_{id}"
pattern = r"(?:https?://)?(?:www\.)?imagex\.com/gallery/([a-zA-Z0-9]+)"
def __init__(self, match):
ImagexExtractor.__init__(self, match)
self.gallery_id = match.group(1)
def items(self):
url = f"{self.root}/gallery/{self.gallery_id}"
page = self.request(url).text
# Extract gallery title
title = text.extract(page, '<h1>', '</h1>')[0]
# Extract all image URLs
gallery_data = {
"gallery_id": self.gallery_id,
"title": title or self.gallery_id,
}
yield Message.Directory, gallery_data
# Find all image containers
image_containers = text.extract_iter(page, '<div class="image-container">', '</div>')
for num, container in enumerate(image_containers, 1):
# Extract image URL and ID
image_url = text.extract(container, 'src="', '"')[0]
image_id = text.extract(container, 'data-id="', '"')[0]
# Prepare image metadata
data = {
"gallery_id": self.gallery_id,
"id": image_id,
"num": num,
}
text.nameext_from_url(image_url, data)
# Add gallery metadata
data.update(gallery_data)
yield Message.Url, image_url, data
Add your extractor to the module list in gallery_dl/extractor/__init__.py
:
# gallery_dl/extractor/__init__.py
modules = [
# ...
"imagex",
# ...
]
Test your extractor with a URL:
$ gallery-dl -v "https://imagex.com/gallery/abc123"
Many websites offer APIs that provide data in structured formats like JSON. Here's an example using Flickr's API:
class FlickrAPIClient:
"""Minimal interface for the Flickr API"""
API_URL = "https://api.flickr.com/services/rest/"
API_KEY = "your_api_key"
def __init__(self, extractor):
self.extractor = extractor
def photos_getInfo(self, photo_id):
"""Get information about a photo"""
params = {
"method": "flickr.photos.getInfo",
"photo_id": photo_id,
"api_key": self.API_KEY,
"format": "json",
"nojsoncallback": "1",
}
return self.extractor.request(self.API_URL, params=params).json()["photo"]
Some websites require authentication to access content:
def login(self):
"""Login and set necessary cookies"""
username, password = self._get_auth_info()
if username:
self.log.info("Logging in as %s", username)
url = self.root + "/login"
data = {
"username": username,
"password": password,
"remember": "1",
}
response = self.request(url, method="POST", data=data)
if not response.cookies.get("sessionid"):
raise exception.AuthenticationError("Login failed")
return True
return False
Many websites implement pagination for content spanning multiple pages:
def images(self, page):
"""Return all image URLs from a paginated gallery"""
url = self.gallery_url
images = []
page_num = 1
while True:
self.log.info("Downloading page %d", page_num)
response = self.request(url)
# Extract images from current page
page_images = self._extract_images_from_page(response.text)
images.extend(page_images)
# Look for next page link
next_url = text.extract(response.text, 'class="next" href="', '"')[0]
if not next_url:
return images
url = self.root + next_url
page_num += 1
Some websites load images dynamically using JavaScript:
def _extract_images_from_page(self, page):
"""Extract both static and lazy-loaded images"""
images = []
# Extract static images
for url in text.extract_iter(page, '<img src="', '"'):
if "/placeholder.jpg" not in url:
images.append(url)
# Extract lazy-loaded images
for url in text.extract_iter(page, 'data-src="', '"'):
if url not in images:
images.append(url)
return images
For websites primarily focused on image galleries:
class DesktopographyExhibitionExtractor(DesktopographyExtractor):
"""Extractor for a yearly desktopography exhibition"""
subcategory = "exhibition"
pattern = r"https?://desktopography\.net/exhibition-([^/?#]+)/"
def __init__(self, match):
DesktopographyExtractor.__init__(self, match)
self.year = match.group(1)
def items(self):
url = "{}/exhibition-{}/".format(self.root, self.year)
base_entry_url = "https://desktopography.net/portfolios/"
page = self.request(url).text
data = {
"_extractor": DesktopographyEntryExtractor,
"year": self.year,
}
for entry_url in text.extract_iter(
page,
'<a class="overlay-background" href="' + base_entry_url,
'">'):
url = base_entry_url + entry_url
yield Message.Queue, url, data
For websites with blog posts containing images:
class BloggerPostExtractor(BloggerExtractor):
"""Extractor for a single blog post"""
subcategory = "post"
pattern = r"[\w-]+\.blogspot\.com(/\d\d\d\d/\d\d/[^/?#]+\.html)"
def __init__(self, match):
BloggerExtractor.__init__(self, match)
self.path = match.group(match.lastindex)
def posts(self, blog):
return (self.api.post_by_path(blog["id"], self.path),)
For websites with comprehensive APIs:
class FlickrImageExtractor(FlickrExtractor):
"""Extractor for individual images from flickr.com"""
subcategory = "image"
pattern = r"(?:https?://)?(?:www\.|secure\.|m\.)?flickr\.com/photos/[^/?#]+/(\d+)"
def items(self):
photo = self.api.photos_getInfo(self.item_id)
self.api._extract_metadata(photo)
if photo["media"] == "video" and self.api.videos:
self.api._extract_video(photo)
else:
self.api._extract_photo(photo)
photo["user"] = photo["owner"]
photo["title"] = photo["title"]["_content"]
photo["comments"] = text.parse_int(photo["comments"]["_content"])
photo["description"] = photo["description"]["_content"]
photo["date"] = text.parse_timestamp(photo["dateuploaded"])
photo["id"] = text.parse_int(photo["id"])
url = self._file_url(photo)
yield Message.Directory, photo
yield Message.Url, url, text.nameext_from_url(url, photo)
For simple file hosting websites:
class CatboxFileExtractor(Extractor):
"""Extractor for catbox files"""
category = "catbox"
subcategory = "file"
archive_fmt = "{filename}"
pattern = r"(?:https?://)?(?:files|litter|de)\.catbox\.moe/([^/?#]+)"
def items(self):
url = text.ensure_http_scheme(self.url)
file = text.nameext_from_url(url, {"url": url})
yield Message.Directory, file
yield Message.Url, url, file
The framework includes a test system to validate extractor functionality:
# test/results/imagex.py
# Import your extractor:
from gallery_dl.extractor import imagex
# Create test cases and assert the outcome of that test.
{
"#url" : "https://www.imagex.com/image/testimage2239",
"#id" : "testimage2239"
},
Issue: Extractor not being recognized for a URL
Solution: Test your regex pattern separately:
import re
pattern = r"(?:https?://)?example\.com/gallery/(\d+)"
url = "https://example.com/gallery/123"
match = re.match(pattern, url)
print(bool(match), match.groups() if match else None)
Issue: text.extract()
returns None
or empty string
Solution: Print the page content to see actual structure:
def items(self):
page = self.request(self.url).text
with open("debug.html", "w", encoding="utf-8") as f:
f.write(page)
# Continue with extraction...
-
Enable Verbose Logging:
$ gallery-dl -v URL
-
Dump HTTP Responses:
def __init__(self, match): Extractor.__init__(self, match) self._write_pages = True
-
Examine Request Headers:
def items(self): response = self.request(self.url) print("Request Headers:", response.request.headers) print("Response Headers:", response.headers)
The base class for all extractors.
Attributes:
-
category
: Site identifier (e.g., "flickr") -
subcategory
: Content type (e.g., "image", "gallery") -
pattern
: Regular expression to match URLs -
directory_fmt
: Format string for directory names -
filename_fmt
: Format string for file names -
archive_fmt
: Format string for archive entries
Methods:
-
items()
: Yields messages for downloading -
request(url, ...)
: Makes HTTP requests -
config(key, default=None)
: Gets configuration values -
log.info/debug/warning/error(...)
: Logging functions
Base class for gallery extractors.
Methods:
-
metadata(page)
: Returns gallery metadata -
images(page)
: Returns a list of image URLs and metadata
Specialized extractor for manga/comic chapters.
Extractor for manga series with multiple chapters.
-
Message.Version
: Protocol version identifier -
Message.Directory
: Directory information for subsequent files -
Message.Url
: URL to be downloaded -
Message.Queue
: URL to be processed by another extractor
-
text.extract(text, start, end)
: Extracts text between markers -
text.extract_iter(text, start, end)
: Iterates over all matches -
text.nameext_from_url(url, data=None)
: Extracts filename and extension -
text.parse_int(string, default=0)
: Converts string to integer -
text.parse_timestamp(string)
: Converts timestamp string to datetime
Common configuration options for extractors include:
-
username
/password
: Login credentials -
cookies
: Cookies for authenticated sessions -
retries
: Number of times to retry failed requests -
sleep-request
: Time to wait between requests -
timeout
: Request timeout in seconds -
proxy
: Proxy server to use -
verify
: Whether to verify SSL certificates