diff --git a/README.md b/README.md index 5c8b7be..2f64f68 100644 --- a/README.md +++ b/README.md @@ -1,21 +1,26 @@ -# Documentation Crawler and Converter v.0.3 +# Documentation Crawler and Converter v1.0.0 This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document. +**Version 1.0.0** introduces significant improvements, including support for JavaScript-rendered pages using Playwright and a fully asynchronous implementation. + ## Features +- **JavaScript Rendering**: Utilizes Playwright to accurately render pages that rely on JavaScript, ensuring complete and up-to-date content capture. - Crawls documentation websites and combines pages into a single Markdown file. -- Removes common sections that appear across many pages, including them once at the beginning. -- Customizable threshold for similarity. +- Removes common sections that appear across many pages, including them once at the end of the document. +- Customizable threshold for similarity to control deduplication sensitivity. - Configurable selectors to remove specific elements from pages. - Supports robots.txt compliance with an option to ignore it. -- **NEW in v0.3.3**: Ability to skip URLs based on ignore-paths both pre-fetch (before requesting content) and post-fetch (after redirects). + ## **NEW in v1.0.0**: + - Javascript rendering, waiting for page to stabilize before scraping. + - Asynchronous Operation: Fully asynchronous methods enhance performance and scalability during the crawling process. ## Installation ### Prerequisites -- **Python 3.6 or higher** is required. +- **Python 3.7 or higher** is required. - (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects. ### 1. Installing the Package with `pip` @@ -49,11 +54,13 @@ It is recommended to use a virtual environment to isolate the package and its de 2. **Activate the virtual environment**: - On **macOS/Linux**: + ```bash source venv/bin/activate ``` - On **Windows**: + ```bash .\venv\Scripts\activate ``` @@ -66,7 +73,17 @@ It is recommended to use a virtual environment to isolate the package and its de This ensures that all dependencies are installed within the virtual environment. -### 4. Installing from PyPI +### 4. Installing Playwright Browsers + +After installing the package, you need to install the necessary Playwright browser binaries: + +```bash +playwright install +``` + +This command downloads the required browser binaries (Chromium, Firefox, and WebKit) used by Playwright for rendering pages. + +### 5. Installing from PyPI Once the package is published on PyPI, you can install it directly using: @@ -74,7 +91,7 @@ Once the package is published on PyPI, you can install it directly using: pip install libcrawler ``` -### 5. Upgrading the Package +### 6. Upgrading the Package To upgrade the package to the latest version, use: @@ -84,7 +101,7 @@ pip install --upgrade libcrawler This will upgrade the package to the newest version available. -### 6. Verifying the Installation +### 7. Verifying the Installation You can verify that the package has been installed correctly by running: @@ -102,7 +119,7 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS] ### Arguments -- `BASE_URL`: The base URL of the documentation site (e.g., https://example.com). +- `BASE_URL`: The base URL of the documentation site (e.g., _https://example.com_). - `STARTING_POINT`: The starting path of the documentation (e.g., /docs/). ### Optional Arguments @@ -117,16 +134,18 @@ crawl-docs BASE_URL STARTING_POINT [OPTIONS] - `--ignore-paths PATH [PATH ...]`: List of URL paths to skip during crawling, either before or after fetching content. - `--user-agent USER_AGENT`: Specify a custom User-Agent string (which will be harmonized with any additional headers). - `--headers-file FILE`: Path to a JSON file containing optional headers. Only one of `--headers-file` or `--headers-json` can be used. -- `--headers-json JSON` (JSON string): Optional headers as JSON +- `--headers-json JSON` (JSON string): Optional headers as JSON. ### Examples #### Basic Usage + ```bash crawl-docs https://example.com /docs/ -o output.md ``` #### Adjusting Thresholds + ```bash crawl-docs https://example.com /docs/ -o output.md \ --similarity-threshold 0.7 \ @@ -134,12 +153,14 @@ crawl-docs https://example.com /docs/ -o output.md \ ``` #### Specifying Extra Selectors to Remove + ```bash crawl-docs https://example.com /docs/ -o output.md \ --remove-selectors ".sidebar" ".ad-banner" ``` #### Limiting to Specific Paths + ```bash crawl-docs https://example.com / -o output.md \ --allowed-paths "/docs/" "/api/" @@ -148,24 +169,61 @@ crawl-docs https://example.com / -o output.md \ #### Skipping URLs with Ignore Paths ```bash -Copiar código crawl-docs https://example.com /docs/ -o output.md \ --ignore-paths "/old/" "/legacy/" ``` -### Dependencies +## Dependencies + +- **Python 3.7 or higher** +- [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing. +- [markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown. +- [Playwright](https://playwright.dev/python/docs/intro) for headless browser automation and JavaScript rendering. +- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations. +- Additional dependencies are listed in `requirements.txt`. + +### Installing Dependencies -- Python 3.6 or higher -- BeautifulSoup4 -- datasketch -- requests -- markdownify +After setting up your environment, install all required dependencies using: -Install dependencies using: ```bash pip install -r requirements.txt ``` +**Note**: Ensure you have installed the Playwright browsers by running `playwright install` as mentioned in the Installation section. + ## License -This project is licensed under the LGPLv3. +This project is licensed under the LGPLv3. See the [LICENSE]\(LICENSE) file for details. + +## Contributing + +Contributions are welcome! Please follow these steps to contribute: + +1. **Fork the repository** on GitHub. +2. **Clone your fork** to your local machine: + ```bash + git clone https://github.com/your-username/libcrawler.git + ``` +3. **Create a new branch** for your feature or bugfix: + ```bash + git checkout -b feature-name + ``` +4. **Make your changes** and **commit** them with clear messages: + ```bash + git commit -m "Add feature X" + ``` +5. **Push** your changes to your fork: + ```bash + git push origin feature-name + ``` +6. **Open a Pull Request** on the original repository describing your changes. + +Please ensure your code adheres to the project's coding standards and includes appropriate tests. + +## Acknowledgements + +- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for HTML parsing. +- [Playwright](https://playwright.dev/) for headless browser automation. +- [Markdownify](https://github.com/matthewwithanm/python-markdownify) for converting HTML to Markdown. +- [aiofiles](https://github.com/Tinche/aiofiles) for asynchronous file operations. diff --git a/pyproject.toml b/pyproject.toml index b723fb4..5a77665 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -8,7 +8,7 @@ description = "A tool to crawl documentation and convert to Markdown." authors = [ { name="Robert Collins", email="roberto.tomas.cuentas@gmail.com" } ] -requires-python = ">=3.6" +requires-python = ">=3.7" classifiers = [ "Programming Language :: Python :: 3", "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)", diff --git a/requirements.txt b/requirements.txt index d747608..91fca33 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,6 @@ +aiofiles~=24.1.0 beautifulsoup4~=4.12.3 datasketch~=1.6.5 markdownify~=0.13.1 +playwright~=1.49.1 Requests~=2.32.3 \ No newline at end of file diff --git a/src/libcrawler/__init__.py b/src/libcrawler/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/src/libcrawler/__main__.py b/src/libcrawler/__main__.py index 3592711..b169be1 100644 --- a/src/libcrawler/__main__.py +++ b/src/libcrawler/__main__.py @@ -1,3 +1,4 @@ +import asyncio import argparse import json from urllib.parse import urljoin @@ -18,6 +19,7 @@ def main(): help='Delay between requests in seconds.') parser.add_argument('--delay-range', type=float, default=0.5, help='Range for random delay variation.') + parser.add_argument('--interval', type=int, help='Time step used in wait for DOM to stablize, in milliseconds (default: 1000 ms)') parser.add_argument('--remove-selectors', nargs='*', help='Additional CSS selectors to remove from pages.') parser.add_argument('--similarity-threshold', type=float, default=0.6, @@ -55,7 +57,7 @@ def main(): start_url = urljoin(args.base_url, args.starting_point) # Adjust crawl_and_convert call to handle ignore-paths and optional headers - crawl_and_convert( + asyncio.run(crawl_and_convert( start_url=start_url, base_url=args.base_url, output_filename=args.output, @@ -68,8 +70,8 @@ def main(): similarity_threshold=args.similarity_threshold, allowed_paths=args.allowed_paths, ignore_paths=args.ignore_paths # Pass the ignore-paths argument - ) + )) if __name__ == '__main__': - main() \ No newline at end of file + main() diff --git a/src/libcrawler/libcrawler.py b/src/libcrawler/libcrawler.py index ef80850..74e84d5 100644 --- a/src/libcrawler/libcrawler.py +++ b/src/libcrawler/libcrawler.py @@ -5,14 +5,15 @@ from .version import __version__ +import aiofiles from bs4 import BeautifulSoup from collections import defaultdict from difflib import SequenceMatcher import logging from markdownify import markdownify as md +from playwright.async_api import async_playwright import random -import requests -import time +import asyncio from urllib.parse import urljoin, urlparse, urlunparse from urllib.robotparser import RobotFileParser @@ -24,6 +25,7 @@ common_selectors = ['header', 'footer', 'nav', '.nav', '.navbar', '.footer'] + class PageNode: """Represents a page in the documentation tree.""" @@ -53,17 +55,45 @@ def normalize_url(url): return normalized_url -def fetch_content(url, user_agent=None, headers={}): - """Fetches HTML content from a URL, following redirects.""" +async def wait_for_stable_dom(page, timeout=10000, interval=None): + """Waits for the DOM to stabilize using MutationObserver.""" + if interval is None: + interval = 1000 + await page.evaluate(f""" + new Promise(resolve => {{ + const observer = new MutationObserver((mutations, obs) => {{ + if (document.readyState === 'complete') {{ + obs.disconnect(); // Stop observing + resolve(); + }} + }}); + observer.observe(document.body, {{ childList: true, subtree: true }}); + setTimeout(resolve, {timeout}); // Fallback timeout + }}); + """) + await asyncio.sleep(interval / 1000) # Allow additional time if necessary + + +async def fetch_content(url, user_agent=None, headers=None, interval=None): + """Fetches HTML content from a URL using Playwright, following redirects.""" + if headers is None: + headers = {} # Harmonize user-agent with headers if user_agent: - headers.setdefault('User-Agent', user_agent) - + headers['User-Agent'] = user_agent + try: - response = requests.get(url, headers=headers) - response.raise_for_status() - return response.text, response.url # Return the final redirected URL - except requests.exceptions.RequestException as e: + async with async_playwright() as p: + browser = await p.chromium.launch() + context = await browser.new_context(user_agent=user_agent, extra_http_headers=headers) + page = await context.new_page() + await page.goto(url, wait_until='domcontentloaded') + await wait_for_stable_dom(page, interval=interval) # Wait for the DOM to stabilize + content = await page.content() + final_url = page.url + await browser.close() + return content, final_url + except Exception as e: logger.error(f"Failed to fetch {url}: {e}") return None, None @@ -138,8 +168,11 @@ def remove_common_elements(soup, extra_remove_selectors=None): return soup -def build_tree(start_url, base_url, user_agent='*', handle_robots_txt=True, - headers={}, delay=1, delay_range=0.5, extra_remove_selectors=None, allowed_paths=None, ignore_paths=None): +async def build_tree(start_url, base_url, user_agent='*', handle_robots_txt=True, + headers=None, delay=1, delay_range=0.5, interval=None, + extra_remove_selectors=None, allowed_paths=None, ignore_paths=None): + if headers is None: + headers = {} visited_links = set() root = PageNode(start_url) node_lookup = {} @@ -178,7 +211,7 @@ def build_tree(start_url, base_url, user_agent='*', handle_robots_txt=True, continue logger.info(f'Processing {current_link}') - page_content, page_url = fetch_content(current_node.url, headers=headers) + page_content, page_url = await fetch_content(current_node.url, headers=headers, interval=interval) if not page_content or (page_url and any(ignore_path in page_url for ignore_path in ignore_paths)): continue @@ -230,7 +263,7 @@ def build_tree(start_url, base_url, user_agent='*', handle_robots_txt=True, queue.append(child_node) actual_delay = random.uniform(delay - delay_range, delay + delay_range) - time.sleep(actual_delay) + await asyncio.sleep(actual_delay) # Replaced time.sleep with await asyncio.sleep for url, anchor in url_to_anchor.items(): for key, content in page_markdowns.items(): @@ -360,22 +393,23 @@ def traverse_and_build_markdown(unique_content, common_content, url_to_anchor): return final_markdown -def crawl_and_convert( +async def crawl_and_convert( start_url, base_url, output_filename, user_agent='*', handle_robots_txt=True, - headers={}, + headers=None, delay=1, delay_range=0.5, + interval=None, extra_remove_selectors=None, similarity_threshold=0.8, allowed_paths=None, ignore_paths=None ): # Build the tree and get page_markdowns and url_to_anchor - page_markdowns, url_to_anchor = build_tree( + page_markdowns, url_to_anchor = await build_tree( start_url=start_url, base_url=base_url, user_agent=user_agent, @@ -383,6 +417,7 @@ def crawl_and_convert( headers=headers, delay=delay, delay_range=delay_range, + interval=interval, extra_remove_selectors=extra_remove_selectors, allowed_paths=allowed_paths, ignore_paths=ignore_paths @@ -395,6 +430,5 @@ def crawl_and_convert( final_markdown = traverse_and_build_markdown(unique_content, common_content, url_to_anchor) # Save to file - with open(output_filename, 'w', encoding='utf-8') as f: - f.write(final_markdown) - + async with aiofiles.open(output_filename, 'w', encoding='utf-8') as f: # Use aiofiles for async file operations + await f.write(final_markdown) diff --git a/src/libcrawler/version.py b/src/libcrawler/version.py index fef1a0d..0df0cc8 100644 --- a/src/libcrawler/version.py +++ b/src/libcrawler/version.py @@ -1,2 +1,2 @@ -__version_info__ = ('0', '3', '3') +__version_info__ = ('1', '0', '0') __version__ = '.'.join(__version_info__) diff --git a/src/tests/test_crawler.py b/src/tests/test_crawler.py index bf11fd7..15270b1 100644 --- a/src/tests/test_crawler.py +++ b/src/tests/test_crawler.py @@ -1,9 +1,10 @@ +import asyncio from bs4 import BeautifulSoup import os import logging -import requests +from playwright.async_api import async_playwright import unittest -from unittest.mock import patch, Mock +from unittest.mock import patch, AsyncMock, Mock from urllib.parse import urljoin __package__ = '' @@ -17,81 +18,136 @@ # Disable logging during tests logging.disable(logging.CRITICAL) + class TestFetchContent(unittest.TestCase): - @patch('src.libcrawler.libcrawler.requests.get') - def test_fetch_content_success(self, mock_get): - # Set up the mock response - mock_response = Mock() - mock_response.status_code = 200 - mock_response.text = 'Test content' - mock_response.url = 'http://example.com/test' - mock_get.return_value = mock_response - - # Call the function - content, url = fetch_content('http://example.com/test') - - # Assertions + + @patch('src.libcrawler.libcrawler.async_playwright') + def test_fetch_content_success(self, mock_playwright): + # Mock the async_playwright context + mock_playwright_instance = AsyncMock() + mock_browser = AsyncMock() + mock_context = AsyncMock() + mock_page = AsyncMock() + + # Configure the mock chain + mock_playwright.return_value.__aenter__.return_value = mock_playwright_instance + mock_playwright_instance.chromium.launch.return_value = mock_browser + mock_browser.new_context.return_value = mock_context + mock_context.new_page.return_value = mock_page + + # Mock page content and URL + mock_page.content.return_value = 'Test content' + mock_page.url = 'http://example.com/test' + + # Run the fetch_content function asynchronously + content, url = asyncio.run(fetch_content('http://example.com/test')) + + # Assertions to verify the Playwright API calls + mock_playwright_instance.chromium.launch.assert_awaited_once() + mock_browser.new_context.assert_awaited_once_with(user_agent=None, extra_http_headers={}) + mock_context.new_page.assert_awaited_once() + mock_page.goto.assert_awaited_once_with('http://example.com/test', wait_until='domcontentloaded') + mock_page.content.assert_awaited_once() + + # Assertions for the function output self.assertEqual(content, 'Test content') self.assertEqual(url, 'http://example.com/test') - # Ensure requests.get was called without headers - mock_get.assert_called_with('http://example.com/test', headers={}) + @patch('src.libcrawler.libcrawler.async_playwright') + def test_fetch_content_with_headers(self, mock_playwright): + # Mock the async_playwright context + mock_playwright_instance = AsyncMock() + mock_browser = AsyncMock() + mock_context = AsyncMock() + mock_page = AsyncMock() - @patch('src.libcrawler.libcrawler.requests.get') - def test_fetch_content_with_headers(self, mock_get): - # Set up the mock response - mock_response = Mock() - mock_response.status_code = 200 - mock_response.text = 'Test content with headers' - mock_response.url = 'http://example.com/test' - mock_get.return_value = mock_response + # Configure the mock chain + mock_playwright.return_value.__aenter__.return_value = mock_playwright_instance + mock_playwright_instance.chromium.launch.return_value = mock_browser + mock_browser.new_context.return_value = mock_context + mock_context.new_page.return_value = mock_page + + # Mock page content and URL + mock_page.content.return_value = 'Test content with headers' + mock_page.url = 'http://example.com/test' # Define headers headers = {'User-Agent': 'test-agent'} - # Call the function with headers - content, url = fetch_content('http://example.com/test', headers=headers) + # Run the fetch_content function asynchronously + content, url = asyncio.run(fetch_content('http://example.com/test', headers=headers)) + + # Assertions to verify the Playwright API calls + mock_playwright_instance.chromium.launch.assert_awaited_once() + mock_browser.new_context.assert_awaited_once_with(user_agent=None, extra_http_headers=headers) + mock_context.new_page.assert_awaited_once() + mock_page.goto.assert_awaited_once_with('http://example.com/test', wait_until='domcontentloaded') + mock_page.content.assert_awaited_once() - # Assertions + # Assertions for the function output self.assertEqual(content, 'Test content with headers') self.assertEqual(url, 'http://example.com/test') - # Ensure requests.get was called with the headers - mock_get.assert_called_with('http://example.com/test', headers=headers) + @patch('src.libcrawler.libcrawler.async_playwright') + def test_fetch_content_failure(self, mock_playwright): + # Mock the async_playwright context + mock_playwright_instance = AsyncMock() + mock_browser = AsyncMock() + mock_context = AsyncMock() + mock_page = AsyncMock() - @patch('src.libcrawler.libcrawler.requests.get') - def test_fetch_content_failure(self, mock_get): - # Set up the mock response to raise an exception - mock_get.side_effect = requests.exceptions.RequestException('Error') + # Configure the mock chain + mock_playwright.return_value.__aenter__.return_value = mock_playwright_instance + mock_playwright_instance.chromium.launch.return_value = mock_browser + mock_browser.new_context.return_value = mock_context + mock_context.new_page.return_value = mock_page - # Call the function - content, url = fetch_content('http://example.com/test') + # Simulate a failure + mock_page.goto.side_effect = Exception('Error') - # Assertions + # Run the fetch_content function asynchronously + content, url = asyncio.run(fetch_content('http://example.com/test')) + + # Assertions for the function output self.assertIsNone(content) self.assertIsNone(url) - @patch('src.libcrawler.libcrawler.requests.get') - def test_user_agent_harmonization(self, mock_get): - # Mock response setup - mock_response = Mock() - mock_response.status_code = 200 - mock_response.text = 'Test content with headers and user-agent' - mock_get.return_value = mock_response + @patch('src.libcrawler.libcrawler.async_playwright') + def test_user_agent_harmonization(self, mock_playwright): + # Mock the async_playwright context + mock_playwright_instance = AsyncMock() + mock_browser = AsyncMock() + mock_context = AsyncMock() + mock_page = AsyncMock() + + # Configure the mock chain + mock_playwright.return_value.__aenter__.return_value = mock_playwright_instance + mock_playwright_instance.chromium.launch.return_value = mock_browser + mock_browser.new_context.return_value = mock_context + mock_context.new_page.return_value = mock_page + + # Mock page content and URL + mock_page.content.return_value = 'Test content with headers and user-agent' + mock_page.url = 'http://example.com/test' # Headers without user-agent headers = {'Accept': 'text/html'} user_agent = 'test-agent' - # Call the function with user-agent and headers - content, url = fetch_content('http://example.com/test', user_agent=user_agent, headers=headers) + # Run the fetch_content function asynchronously + content, url = asyncio.run(fetch_content('http://example.com/test', user_agent=user_agent, headers=headers)) - # Ensure the user-agent is added to the headers - expected_headers = {'Accept': 'text/html', 'User-Agent': 'test-agent'} - mock_get.assert_called_with('http://example.com/test', headers=expected_headers) + # Assertions to verify the Playwright API calls + mock_playwright_instance.chromium.launch.assert_awaited_once() + mock_browser.new_context.assert_awaited_once_with(user_agent=user_agent, extra_http_headers=headers) + mock_context.new_page.assert_awaited_once() + mock_page.goto.assert_awaited_once_with('http://example.com/test', wait_until='domcontentloaded') + mock_page.content.assert_awaited_once() - # Assert content is fetched correctly + # Assertions for the function output self.assertEqual(content, 'Test content with headers and user-agent') + self.assertEqual(url, 'http://example.com/test') + class TestIgnorePaths(unittest.TestCase): def setUp(self): @@ -253,7 +309,7 @@ class TestBuildTree(unittest.TestCase): @patch('src.libcrawler.libcrawler.fetch_content') def test_build_tree(self, mock_fetch_content): # Mock fetch_content to return predefined HTML content - def side_effect(url, headers={}): + def side_effect(url, headers={}, interval=None): if url == 'http://example.com/start': html = ''' @@ -280,7 +336,7 @@ def side_effect(url, headers={}): mock_fetch_content.side_effect = side_effect # Run build_tree - page_markdowns, url_to_anchor = build_tree( + page_markdowns, _url_to_anchor = asyncio.run(build_tree( start_url='http://example.com/start', base_url='http://example.com', handle_robots_txt=False, @@ -288,7 +344,7 @@ def side_effect(url, headers={}): delay_range=0, allowed_paths=None, headers={} - ) + )) # Check that page_markdowns contains two entries self.assertIn('http://example.com/start', page_markdowns) @@ -301,7 +357,7 @@ def side_effect(url, headers={}): @patch('src.libcrawler.libcrawler.fetch_content') def test_build_tree_with_headers(self, mock_fetch_content): # Mock fetch_content to return predefined HTML content - def side_effect(url, headers={}): + def side_effect(url, headers={}, interval=None): if url == 'http://example.com/start': html = ''' @@ -330,7 +386,7 @@ def side_effect(url, headers={}): headers = {'User-Agent': 'test-agent'} # Run build_tree with headers - page_markdowns, url_to_anchor = build_tree( + _page_markdowns, _url_to_anchor = asyncio.run(build_tree( start_url='http://example.com/start', base_url='http://example.com', handle_robots_txt=False, @@ -338,7 +394,7 @@ def side_effect(url, headers={}): delay_range=0, allowed_paths=None, headers=headers - ) + )) # Check that fetch_content was called with correct headers calls = mock_fetch_content.call_args_list @@ -498,7 +554,7 @@ def tearDown(self): @patch('src.libcrawler.libcrawler.fetch_content') def test_crawl_and_convert(self, mock_fetch_content): # Define side effect for fetch_content - def side_effect(url, headers={}): + def side_effect(url, headers={}, interval=None): normalized_url = normalize_url(url) if normalized_url == normalize_url(self.start_url): return self.html_start, url @@ -519,21 +575,23 @@ def side_effect(url, headers={}): headers = {'User-Agent': 'test-agent'} # Run the crawler with appropriate similarity threshold - crawl_and_convert( - start_url=self.start_url, - base_url=self.base_url, - output_filename=self.output_filename, - delay=0, - delay_range=0, - extra_remove_selectors=['header', 'footer', '.footer'], - similarity_threshold=0.6, # Increased threshold - headers=headers + asyncio.run( + crawl_and_convert( + start_url=self.start_url, + base_url=self.base_url, + output_filename=self.output_filename, + delay=0, + delay_range=0, + extra_remove_selectors=['header', 'footer', '.footer'], + similarity_threshold=0.6, # Increased threshold + headers=headers + ) ) # Check that fetch_content was called with headers calls = mock_fetch_content.call_args_list for call in calls: - args, kwargs = call + _args, kwargs = call self.assertIn('headers', kwargs) self.assertEqual(kwargs['headers'], headers)