Consider replacing the requests library with pycurl #244

jjjake · 2018-04-10T22:36:32Z

Pycurl is much faster than requests
Pycurl has 100 continue support
We've run into many ConnectionError issues with requests that we've had trouble diagnosing. IA has a better understanding of libcurl than the requests library.

I'm also considering alternatives to pycurl, but so far it seems to be the best fit. These changes would be mostly invisible to users (e.g. return Response objects similar to the requests.Response objects that are now returned, etc.).

Feedback is welcome.

StephenKrewson · 2018-08-29T14:34:14Z

Support this! Have been having ConnectionError timeouts and max retries exceeded.

Passing in keyword args to Requests lib through get_session is kind of a hassle (and not sure if it will be any better than the defaults).

zrnsm · 2019-02-17T21:04:08Z

I'd be in favor of this as well. I've been running into similar timeout issues downloading large files (> 2GB). A simple hack that seems to be working for me is to manually crank up the timeout here:

internetarchive/internetarchive/files.py

Line 243 in 98be93d

response = self.item.session.get(self.url, stream=True, timeout=12)

I just noticed that a much longer timeout is used in the upload case:

internetarchive/internetarchive/item.py

Line 592 in d093cde

request_kwargs['timeout'] = 120

request_kwargs could be similarly exposed in the download case as a short term solution.

maxz · 2021-04-18T09:09:23Z

I wrote a library similar to internetarchive for my own use and use libcurl in the background for uploads and downloads.

Using Curl generally makes the whole transfer very reliable and would probably avoid many of the transfer related errors which are listed as issues on this repository, but there are some problems due to the Internet Archive's error codes or responses e. g. it is nice to have Curl's Retry-After support for uploads in case of a too high server load and the server correctly returns a Retry-After response in this case which Curl can handle without any further intervention. But the server also wrongly returns a status 500 (the proper code would be 400) if the identifier is already in use, which results in an endless loop if not handled separately and when bundled with Curl's retry behaviour, so there are definitely some caveats, which are mostly due to errors on the Archive's side.

I have not looked at PyCurl in particular, but it can be a bit unidiomatic to use a wrapper for libcurl instead of using a native web request library. But that depends entirely on how well PyCurl is written and might not be a problem, just something to keep in mind.

jjjake · 2021-04-23T17:38:20Z

Thanks for the feedback @maxz, I appreciate it! I haven't really considered this change in a while, and not sure it's something that I'd still consider doing at this time. I'll keep this open in case anyone else has feedback though. Thanks again!

maxz · 2021-07-21T08:40:02Z

I keep thinking about ways to move to PyCurl, although I did not come to a satisfying solution yet.

One way would be to write a wrapper which provides the same interface that the current session classes provide.
Another way would be to slowly implement new features and rewrite old program parts using PyCurl.

The wrapper would be a bit of a dirty solution which would further increase the cruft that we already have in some parts of ia.

The wrapper and the rewrite could also be combined, so that the wrapper serves as a stand-in as long as the relevant parts have not been rewritten.

As previously mentioned, I think it would have some real benefits for the reliability of transfers.
Another point is that with the multipart upload I'm currently working on, PyCurl would enable the parallel uploads of a single file.
Implementing this with Requests is a nightmare. One would have to manage all the threads or processes manually and it would needlessly complicate the code. So I would only recommend to ever implement parallel uploads, if it came with a switch to PyCurl.

Did you drop the idea because it is so much work or what was the matter, @jjjake?

laptopsftw · 2022-05-03T19:29:36Z

I keep thinking about ways to move to PyCurl, although I did not come to a satisfying solution yet.

One way would be to write a wrapper which provides the same interface that the current session classes provide. Another way would be to slowly implement new features and rewrite old program parts using PyCurl.

The wrapper would be a bit of a dirty solution which would further increase the cruft that we already have in some parts of ia.

The wrapper and the rewrite could also be combined, so that the wrapper serves as a stand-in as long as the relevant parts have not been rewritten.

As previously mentioned, I think it would have some real benefits for the reliability of transfers. Another point is that with the multipart upload I'm currently working on, PyCurl would enable the parallel uploads of a single file. Implementing this with Requests is a nightmare. One would have to manage all the threads or processes manually and it would needlessly complicate the code. So I would only recommend to ever implement parallel uploads, if it came with a switch to PyCurl.

Did you drop the idea because it is so much work or what was the matter, @jjjake?

i know this is a bit of a dumb reply but, can a requests-compatible pycurl would work on this? like this

maxz · 2022-05-04T06:28:24Z

It probably could, depending on how well it covers all of Curl and how well it follows the behaviour and API of Requests. But a very important factor would also be how easily we can access the underlying Curl API, because if it only is Requests with another base, then we won't gain the features necessary for better up- and download control.

It's probably not a feature that anyone would want to work on right now, especially since we are currently converting everything to be type annotated and still have some Python 2 cleanup to do.

jjjake added the v2 label Apr 10, 2018

jjjake self-assigned this Apr 10, 2018

LogicalKnee mentioned this issue Jun 9, 2021

Feature request: download multiple files from a collection in parallel #412

Open

jjjake closed this as completed Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider replacing the requests library with pycurl #244

Consider replacing the requests library with pycurl #244

jjjake commented Apr 10, 2018

StephenKrewson commented Aug 29, 2018

zrnsm commented Feb 17, 2019

maxz commented Apr 18, 2021 •

edited

Loading

jjjake commented Apr 23, 2021

maxz commented Jul 21, 2021

laptopsftw commented May 3, 2022

maxz commented May 4, 2022

Consider replacing the requests library with pycurl #244

Consider replacing the requests library with pycurl #244

Comments

jjjake commented Apr 10, 2018

StephenKrewson commented Aug 29, 2018

zrnsm commented Feb 17, 2019

maxz commented Apr 18, 2021 • edited Loading

jjjake commented Apr 23, 2021

maxz commented Jul 21, 2021

laptopsftw commented May 3, 2022

maxz commented May 4, 2022

maxz commented Apr 18, 2021 •

edited

Loading