-
Notifications
You must be signed in to change notification settings - Fork 259
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add new curl impersonate HTTP client (#387)
### Description - Add a new curl impersonate HTTP client utilizing the [curl-cffi](https://pypi.org/project/curl-cffi/) package. - Improve API docs of the HTTP clients and define public & private interfaces. - I encountered a few bugs or "not great behaviour" of `curl-cffi`, opening issues: - lexiforest/curl_cffi#360 - lexiforest/curl_cffi#361 - Because of the above bugs, I decided not to set curl impersonate client as default and stay with HTTPX client. - I also had to move some general components from the `basic_crawler` module to the root of the package. Maybe it's not good, so I am open to other options on how to sort it out. ### Issues - Closes: #292 ### Testing - New unit tests were written. - Or check the example below utilizing curl impersonate client with BeautifulSoup crawler. ```python import asyncio from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.http_clients import CurlImpersonateHttpClient from crawlee.proxy_configuration import ProxyConfiguration async def main() -> None: proxy_configuration = ProxyConfiguration( proxy_urls=[ 'http://username:password@proxy.apify.com:8000', ], ) crawler = BeautifulSoupCrawler( max_requests_per_crawl=10, proxy_configuration=proxy_configuration, http_client=CurlImpersonateHttpClient(), ) @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') await context.enqueue_links() data = { 'url': context.request.url, 'title': context.soup.title.text if context.soup.title else '', } context.log.info(f'Extracted data: {data}') await context.push_data(data) await crawler.run(['https://apify.com', 'https://crawlee.dev/']) crawler.log.info('Finished crawling.') if __name__ == '__main__': asyncio.run(main()) ``` ### TODO - [ ] Before merging add better documentation of HTTP clients, an option of switching them, and implementing a new one. ### Checklist - [x] CI passed
- Loading branch information
Showing
36 changed files
with
936 additions
and
526 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,5 @@ | ||
from .basic_crawler import BasicCrawler, BasicCrawlerOptions | ||
from .context_pipeline import ContextPipeline | ||
from .router import Router | ||
from .types import BasicCrawlingContext | ||
|
||
__all__ = ['BasicCrawler', 'BasicCrawlerOptions', 'ContextPipeline', 'Router', 'BasicCrawlingContext'] | ||
__all__ = ['BasicCrawler', 'BasicCrawlerOptions', 'ContextPipeline', 'Router'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,12 @@ | ||
from .base_http_client import BaseHttpClient, HttpCrawlingResult, HttpResponse | ||
from .httpx_client import HttpxClient | ||
from .base import BaseHttpClient, HttpCrawlingResult, HttpResponse | ||
from .httpx import HttpxHttpClient | ||
|
||
__all__ = ['BaseHttpClient', 'HttpCrawlingResult', 'HttpResponse', 'HttpxClient'] | ||
try: | ||
from .curl_impersonate import CurlImpersonateHttpClient | ||
except ImportError as exc: | ||
raise ImportError( | ||
"To import anything from this subpackage, you need to install the 'curl-impersonate' extra." | ||
"For example, if you use pip, run `pip install 'crawlee[curl-impersonate]'`.", | ||
) from exc | ||
|
||
__all__ = ['BaseHttpClient', 'CurlImpersonateHttpClient', 'HttpCrawlingResult', 'HttpResponse', 'HttpxHttpClient'] |
Oops, something went wrong.