-
-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Disabled proxy resulting in general RequestsError
400
#361
Comments
currently, curl-cffi only has one general error for all curl errors. Adding support for mapping curl errors into Python errors would be a nice addition. For now, the way of detecting certain errors is by using the curl error code. However for this particular error, curl does not appear to distinguish the proxy error by a code so you have to do a string match in addition to code check. An example from our project: https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp%2Fnetworking%2F_curlcffi.py#L241 |
Progress is tracked in #250. |
### Description - Add a new curl impersonate HTTP client utilizing the [curl-cffi](https://pypi.org/project/curl-cffi/) package. - Improve API docs of the HTTP clients and define public & private interfaces. - I encountered a few bugs or "not great behaviour" of `curl-cffi`, opening issues: - lexiforest/curl_cffi#360 - lexiforest/curl_cffi#361 - Because of the above bugs, I decided not to set curl impersonate client as default and stay with HTTPX client. - I also had to move some general components from the `basic_crawler` module to the root of the package. Maybe it's not good, so I am open to other options on how to sort it out. ### Issues - Closes: #292 ### Testing - New unit tests were written. - Or check the example below utilizing curl impersonate client with BeautifulSoup crawler. ```python import asyncio from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.http_clients import CurlImpersonateHttpClient from crawlee.proxy_configuration import ProxyConfiguration async def main() -> None: proxy_configuration = ProxyConfiguration( proxy_urls=[ 'http://username:password@proxy.apify.com:8000', ], ) crawler = BeautifulSoupCrawler( max_requests_per_crawl=10, proxy_configuration=proxy_configuration, http_client=CurlImpersonateHttpClient(), ) @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') await context.enqueue_links() data = { 'url': context.request.url, 'title': context.soup.title.text if context.soup.title else '', } context.log.info(f'Extracted data: {data}') await context.push_data(data) await crawler.run(['https://apify.com', 'https://crawlee.dev/']) crawler.log.info('Finished crawling.') if __name__ == '__main__': asyncio.run(main()) ``` ### TODO - [ ] Before merging add better documentation of HTTP clients, an option of switching them, and implementing a new one. ### Checklist - [x] CI passed
Describe the bug
RequestsError
exception with "CONNECT tunnel failed, response 400".To Reproduce
Resulting in:
Expected behavior
Example in HTTPX:
Results in:
Versions
curl_cffi
version: 0.7.1proxy.py
version: 2.4.4httpx
version: 0.27.0The text was updated successfully, but these errors were encountered: