[BUG] Disabled proxy resulting in general `RequestsError` 400 #361

vdusek · 2024-08-01T14:52:57Z

Describe the bug

When the proxy I'm using is disabled, the execution results in the general RequestsError exception with "CONNECT tunnel failed, response 400".
This is bad because I cannot detect that the error has something to do with my proxy.

To Reproduce

Using proxy.py as proxy server.

import asyncio
from curl_cffi.requests import AsyncSession
from proxy import Proxy

async def main():
    session = AsyncSession()

    with Proxy(
        [
            '--hostname',
            '127.0.0.1',
            '--port',
            '8899',
            '--basic-auth',
            'username:password',
            '--disable-http-proxy',
        ]
    ):
        response = await session.request(
            'get',
            'https://httpbin.org/get',
            proxy='http://username:password@127.0.0.1:8899',
        )
        print(f'status_code: {response.status_code}')
        print(f'content: {response.content[:1000].decode()}')

if __name__ == '__main__':
    asyncio.run(main())

Resulting in:

$ python run_curl_cffi_bug.py 
2024-08-01 16:45:33,542 - pid:312119 [I] plugins.load:89 - Loaded plugin proxy.http.proxy.auth.AuthPlugin
Traceback (most recent call last):
  File "/home/vdusek/Projects/crawlee-py/.venv/lib64/python3.12/site-packages/curl_cffi/requests/session.py", line 1264, in request
    await task
curl_cffi.curl.CurlError: Failed to perform, curl: (56) CONNECT tunnel failed, response 400. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vdusek/Projects/crawlee-py/run_curl_cffi_bug.py", line 31, in <module>
    asyncio.run(main())
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/vdusek/Projects/crawlee-py/run_curl_cffi_bug.py", line 21, in main
    response = await session.request(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vdusek/Projects/crawlee-py/.venv/lib64/python3.12/site-packages/curl_cffi/requests/session.py", line 1268, in request
    raise RequestsError(str(e), e.code, rsp) from e
curl_cffi.requests.errors.RequestsError: Failed to perform, curl: (56) CONNECT tunnel failed, response 400. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

Expected behavior

Some proxy-error-specific exceptions should be raised. HTTPX could serve as an inspiration.

Example in HTTPX:

import asyncio
from httpx import AsyncClient
from proxy import Proxy

async def main():
    with Proxy(
        [
            '--hostname',
            '127.0.0.1',
            '--port',
            '8899',
            '--basic-auth',
            'username:password',
            '--disable-http-proxy',
        ]
    ):
        async with AsyncClient(proxy='http://username:password@127.0.0.1:8899') as client:
            response = await client.get(url='https://httpbin.org/get')
            print(f'status_code: {response.status_code}')
            print(f'content: {response.read()[:1000].decode()}')

if __name__ == '__main__':
    asyncio.run(main())

Results in:

httpx.ProxyError: 400 BAD REQUEST

Versions

curl_cffi version: 0.7.1
proxy.py version: 2.4.4
httpx version: 0.27.0

The text was updated successfully, but these errors were encountered:

coletdjnz · 2024-08-01T19:35:11Z

currently, curl-cffi only has one general error for all curl errors. Adding support for mapping curl errors into Python errors would be a nice addition.

For now, the way of detecting certain errors is by using the curl error code. However for this particular error, curl does not appear to distinguish the proxy error by a code so you have to do a string match in addition to code check.

An example from our project: https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp%2Fnetworking%2F_curlcffi.py#L241

perklet · 2024-08-02T02:58:29Z

Progress is tracked in #250.

### Description - Add a new curl impersonate HTTP client utilizing the [curl-cffi](https://pypi.org/project/curl-cffi/) package. - Improve API docs of the HTTP clients and define public & private interfaces. - I encountered a few bugs or "not great behaviour" of `curl-cffi`, opening issues: - lexiforest/curl_cffi#360 - lexiforest/curl_cffi#361 - Because of the above bugs, I decided not to set curl impersonate client as default and stay with HTTPX client. - I also had to move some general components from the `basic_crawler` module to the root of the package. Maybe it's not good, so I am open to other options on how to sort it out. ### Issues - Closes: #292 ### Testing - New unit tests were written. - Or check the example below utilizing curl impersonate client with BeautifulSoup crawler. ```python import asyncio from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from crawlee.http_clients import CurlImpersonateHttpClient from crawlee.proxy_configuration import ProxyConfiguration async def main() -> None: proxy_configuration = ProxyConfiguration( proxy_urls=[ 'http://username:password@proxy.apify.com:8000', ], ) crawler = BeautifulSoupCrawler( max_requests_per_crawl=10, proxy_configuration=proxy_configuration, http_client=CurlImpersonateHttpClient(), ) @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url}...') await context.enqueue_links() data = { 'url': context.request.url, 'title': context.soup.title.text if context.soup.title else '', } context.log.info(f'Extracted data: {data}') await context.push_data(data) await crawler.run(['https://apify.com', 'https://crawlee.dev/']) crawler.log.info('Finished crawling.') if __name__ == '__main__': asyncio.run(main()) ``` ### TODO - [ ] Before merging add better documentation of HTTP clients, an option of switching them, and implementing a new one. ### Checklist - [x] CI passed

vdusek added the bug Something isn't working label Aug 1, 2024

vdusek assigned perklet Aug 1, 2024

vdusek mentioned this issue Aug 1, 2024

feat: add new curl impersonate HTTP client apify/crawlee-python#387

Merged

2 tasks

perklet mentioned this issue Aug 2, 2024

Implement requests-like exception hierarchy, close #201 #250

Merged

perklet closed this as completed in #250 Aug 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Disabled proxy resulting in general `RequestsError` 400 #361

[BUG] Disabled proxy resulting in general `RequestsError` 400 #361

vdusek commented Aug 1, 2024

coletdjnz commented Aug 1, 2024

perklet commented Aug 2, 2024

[BUG] Disabled proxy resulting in general RequestsError 400 #361

[BUG] Disabled proxy resulting in general RequestsError 400 #361

Comments

vdusek commented Aug 1, 2024

Describe the bug

To Reproduce

Expected behavior

Example in HTTPX:

Versions

coletdjnz commented Aug 1, 2024

perklet commented Aug 2, 2024

[BUG] Disabled proxy resulting in general `RequestsError` 400 #361

[BUG] Disabled proxy resulting in general `RequestsError` 400 #361