Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing the requests library with pycurl #244

Closed
jjjake opened this issue Apr 10, 2018 · 7 comments
Closed

Consider replacing the requests library with pycurl #244

jjjake opened this issue Apr 10, 2018 · 7 comments
Assignees
Labels

Comments

@jjjake
Copy link
Owner

jjjake commented Apr 10, 2018

  • Pycurl is much faster than requests
  • Pycurl has 100 continue support
  • We've run into many ConnectionError issues with requests that we've had trouble diagnosing. IA has a better understanding of libcurl than the requests library.

I'm also considering alternatives to pycurl, but so far it seems to be the best fit. These changes would be mostly invisible to users (e.g. return Response objects similar to the requests.Response objects that are now returned, etc.).

Feedback is welcome.

@jjjake jjjake added the v2 label Apr 10, 2018
@jjjake jjjake self-assigned this Apr 10, 2018
@StephenKrewson
Copy link

Support this! Have been having ConnectionError timeouts and max retries exceeded.

Passing in keyword args to Requests lib through get_session is kind of a hassle (and not sure if it will be any better than the defaults).

@zrnsm
Copy link

zrnsm commented Feb 17, 2019

I'd be in favor of this as well. I've been running into similar timeout issues downloading large files (> 2GB). A simple hack that seems to be working for me is to manually crank up the timeout here:

response = self.item.session.get(self.url, stream=True, timeout=12)

I just noticed that a much longer timeout is used in the upload case:

request_kwargs['timeout'] = 120

request_kwargs could be similarly exposed in the download case as a short term solution.

@maxz
Copy link
Contributor

maxz commented Apr 18, 2021

I wrote a library similar to internetarchive for my own use and use libcurl in the background for uploads and downloads.

Using Curl generally makes the whole transfer very reliable and would probably avoid many of the transfer related errors which are listed as issues on this repository, but there are some problems due to the Internet Archive's error codes or responses e. g. it is nice to have Curl's Retry-After support for uploads in case of a too high server load and the server correctly returns a Retry-After response in this case which Curl can handle without any further intervention. But the server also wrongly returns a status 500 (the proper code would be 400) if the identifier is already in use, which results in an endless loop if not handled separately and when bundled with Curl's retry behaviour, so there are definitely some caveats, which are mostly due to errors on the Archive's side.

I have not looked at PyCurl in particular, but it can be a bit unidiomatic to use a wrapper for libcurl instead of using a native web request library. But that depends entirely on how well PyCurl is written and might not be a problem, just something to keep in mind.

@jjjake
Copy link
Owner Author

jjjake commented Apr 23, 2021

Thanks for the feedback @maxz, I appreciate it! I haven't really considered this change in a while, and not sure it's something that I'd still consider doing at this time. I'll keep this open in case anyone else has feedback though. Thanks again!

@maxz
Copy link
Contributor

maxz commented Jul 21, 2021

I keep thinking about ways to move to PyCurl, although I did not come to a satisfying solution yet.

One way would be to write a wrapper which provides the same interface that the current session classes provide.
Another way would be to slowly implement new features and rewrite old program parts using PyCurl.

The wrapper would be a bit of a dirty solution which would further increase the cruft that we already have in some parts of ia.

The wrapper and the rewrite could also be combined, so that the wrapper serves as a stand-in as long as the relevant parts have not been rewritten.

As previously mentioned, I think it would have some real benefits for the reliability of transfers.
Another point is that with the multipart upload I'm currently working on, PyCurl would enable the parallel uploads of a single file.
Implementing this with Requests is a nightmare. One would have to manage all the threads or processes manually and it would needlessly complicate the code. So I would only recommend to ever implement parallel uploads, if it came with a switch to PyCurl.

Did you drop the idea because it is so much work or what was the matter, @jjjake?

@jjjake jjjake closed this as completed Feb 3, 2022
@laptopsftw
Copy link

I keep thinking about ways to move to PyCurl, although I did not come to a satisfying solution yet.

One way would be to write a wrapper which provides the same interface that the current session classes provide. Another way would be to slowly implement new features and rewrite old program parts using PyCurl.

The wrapper would be a bit of a dirty solution which would further increase the cruft that we already have in some parts of ia.

The wrapper and the rewrite could also be combined, so that the wrapper serves as a stand-in as long as the relevant parts have not been rewritten.

As previously mentioned, I think it would have some real benefits for the reliability of transfers. Another point is that with the multipart upload I'm currently working on, PyCurl would enable the parallel uploads of a single file. Implementing this with Requests is a nightmare. One would have to manage all the threads or processes manually and it would needlessly complicate the code. So I would only recommend to ever implement parallel uploads, if it came with a switch to PyCurl.

Did you drop the idea because it is so much work or what was the matter, @jjjake?

i know this is a bit of a dumb reply but, can a requests-compatible pycurl would work on this? like this

@maxz
Copy link
Contributor

maxz commented May 4, 2022

It probably could, depending on how well it covers all of Curl and how well it follows the behaviour and API of Requests. But a very important factor would also be how easily we can access the underlying Curl API, because if it only is Requests with another base, then we won't gain the features necessary for better up- and download control.

It's probably not a feature that anyone would want to work on right now, especially since we are currently converting everything to be type annotated and still have some Python 2 cleanup to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants