Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Disabled proxy resulting in general RequestsError 400 #361

Closed
vdusek opened this issue Aug 1, 2024 · 2 comments · Fixed by #250
Closed

[BUG] Disabled proxy resulting in general RequestsError 400 #361

vdusek opened this issue Aug 1, 2024 · 2 comments · Fixed by #250
Assignees
Labels
bug Something isn't working

Comments

@vdusek
Copy link

vdusek commented Aug 1, 2024

Describe the bug

  • When the proxy I'm using is disabled, the execution results in the general RequestsError exception with "CONNECT tunnel failed, response 400".
  • This is bad because I cannot detect that the error has something to do with my proxy.

To Reproduce

import asyncio
from curl_cffi.requests import AsyncSession
from proxy import Proxy

async def main():
    session = AsyncSession()

    with Proxy(
        [
            '--hostname',
            '127.0.0.1',
            '--port',
            '8899',
            '--basic-auth',
            'username:password',
            '--disable-http-proxy',
        ]
    ):
        response = await session.request(
            'get',
            'https://httpbin.org/get',
            proxy='http://username:password@127.0.0.1:8899',
        )
        print(f'status_code: {response.status_code}')
        print(f'content: {response.content[:1000].decode()}')

if __name__ == '__main__':
    asyncio.run(main())

Resulting in:

$ python run_curl_cffi_bug.py 
2024-08-01 16:45:33,542 - pid:312119 [I] plugins.load:89 - Loaded plugin proxy.http.proxy.auth.AuthPlugin
Traceback (most recent call last):
  File "/home/vdusek/Projects/crawlee-py/.venv/lib64/python3.12/site-packages/curl_cffi/requests/session.py", line 1264, in request
    await task
curl_cffi.curl.CurlError: Failed to perform, curl: (56) CONNECT tunnel failed, response 400. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vdusek/Projects/crawlee-py/run_curl_cffi_bug.py", line 31, in <module>
    asyncio.run(main())
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/vdusek/Projects/crawlee-py/run_curl_cffi_bug.py", line 21, in main
    response = await session.request(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vdusek/Projects/crawlee-py/.venv/lib64/python3.12/site-packages/curl_cffi/requests/session.py", line 1268, in request
    raise RequestsError(str(e), e.code, rsp) from e
curl_cffi.requests.errors.RequestsError: Failed to perform, curl: (56) CONNECT tunnel failed, response 400. See https://curl.se/libcurl/c/libcurl-errors.html first for more details.

Expected behavior

  • Some proxy-error-specific exceptions should be raised. HTTPX could serve as an inspiration.

Example in HTTPX:

import asyncio
from httpx import AsyncClient
from proxy import Proxy

async def main():
    with Proxy(
        [
            '--hostname',
            '127.0.0.1',
            '--port',
            '8899',
            '--basic-auth',
            'username:password',
            '--disable-http-proxy',
        ]
    ):
        async with AsyncClient(proxy='http://username:password@127.0.0.1:8899') as client:
            response = await client.get(url='https://httpbin.org/get')
            print(f'status_code: {response.status_code}')
            print(f'content: {response.read()[:1000].decode()}')

if __name__ == '__main__':
    asyncio.run(main())

Results in:

httpx.ProxyError: 400 BAD REQUEST

Versions

  • curl_cffi version: 0.7.1
  • proxy.py version: 2.4.4
  • httpx version: 0.27.0
@coletdjnz
Copy link
Contributor

currently, curl-cffi only has one general error for all curl errors. Adding support for mapping curl errors into Python errors would be a nice addition.

For now, the way of detecting certain errors is by using the curl error code. However for this particular error, curl does not appear to distinguish the proxy error by a code so you have to do a string match in addition to code check.

An example from our project: https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp%2Fnetworking%2F_curlcffi.py#L241

@perklet
Copy link
Collaborator

perklet commented Aug 2, 2024

Progress is tracked in #250.

vdusek added a commit to apify/crawlee-python that referenced this issue Aug 5, 2024
### Description

- Add a new curl impersonate HTTP client utilizing the
[curl-cffi](https://pypi.org/project/curl-cffi/) package.
- Improve API docs of the HTTP clients and define public & private
interfaces.
- I encountered a few bugs or "not great behaviour" of `curl-cffi`,
opening issues:
  - lexiforest/curl_cffi#360
  - lexiforest/curl_cffi#361
- Because of the above bugs, I decided not to set curl impersonate
client as default and stay with HTTPX client.
- I also had to move some general components from the `basic_crawler`
module to the root of the package. Maybe it's not good, so I am open to
other options on how to sort it out.

### Issues

- Closes: #292

### Testing

- New unit tests were written.
- Or check the example below utilizing curl impersonate client with
BeautifulSoup crawler.

```python
import asyncio

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.http_clients import CurlImpersonateHttpClient
from crawlee.proxy_configuration import ProxyConfiguration


async def main() -> None:
    proxy_configuration = ProxyConfiguration(
        proxy_urls=[
            'http://username:password@proxy.apify.com:8000',
        ],
    )

    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=10,
        proxy_configuration=proxy_configuration,
        http_client=CurlImpersonateHttpClient(),
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url}...')
        await context.enqueue_links()

        data = {
            'url': context.request.url,
            'title': context.soup.title.text if context.soup.title else '',
        }

        context.log.info(f'Extracted data: {data}')
        await context.push_data(data)

    await crawler.run(['https://apify.com', 'https://crawlee.dev/'])
    crawler.log.info('Finished crawling.')


if __name__ == '__main__':
    asyncio.run(main())
```

### TODO

- [ ] Before merging add better documentation of HTTP clients, an option
of switching them, and implementing a new one.

### Checklist

- [x] CI passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants