-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a contentprovider for Software Heritage persistent ID (SWHID) #988
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
709ff78
to
c64467c
Compare
Uploaded a new version that does not depend on |
|
||
from os import path | ||
|
||
import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't depend on requests
and have used the standard library urllib
to make HTTP requests. We should take a moment to review if we want to take on the additional maintenance cost of a new dependency vs sticking with using urllib.
Yes I was pretty sure it would be a subject for discussion :-)
Depending on requests seems pretty straightforward and add very little
"dependency hell" material to me, but I can understand your point.
David
…On 27/11/2020 10:25, Tim Head wrote:
***@***.**** commented on this pull request.
------------------------------------------------------------------------
In repo2docker/contentproviders/swhid.py
<#988 (comment)>:
> @@ -0,0 +1,113 @@
+import io
+import os
+import shutil
+import tarfile
+import time
+import re
+
+from os import path
+
+import requests
We currently don't depend on |requests| and have used the standard
library |urllib| to make HTTP requests. We should take a moment to
review if we want to take on the additional maintenance cost of a new
dependency vs sticking with using urllib.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#988 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALNKIUHU632VGS7FWTRLDTSR5V7NANCNFSM4UDQ67VQ>.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice Pull Request!! I left a few comments about some typos I spotted and questions that came to my mind while reading the code.
I think overall it is in good shape. The biggest question for me is: do we take on a new dependency or do we use urllib
. A factor that would decide this is: how much work would it be to perform the HTTP requests without using requests
(maybe using the other content providers for inspiration).
I have a slight preference towards not adding a new dependency (even a popular one like requests).
FTR I've pushed an updated version of this PR with comments/typos taken care of (but the requests->urllib refactoring) |
Hi, what's the process to decide whether adding the dependency on requests is acceptable or not? |
I don't think we have an "official" process. Looking through your changes it looks like you're using Based on this it should be easy to switch to repo2docker/repo2docker/contentproviders/zenodo.py Lines 58 to 61 in e33d5f8
For a new project requests definitely makes sense as it's easier to use and has more features, but since this is an existing codebase there needs to be a justification for switching, which includes dealing with questions like do we make the switch across all files (best done in a separate PR) or only this one (need to justify the complexity for maintainers in understanding multiple HTTP libraries). |
One other person who had an opinion on this is Min. My summary of his point was "requests is standard, without any other influences it is what we should use.". Reflecting on this a bit I think for me the important thing is that we have one way of doing this in repo2docker, not two (or even more). Having one way means you only need to know one way, consistency amongst the code base, can keep testing infrastructure consistent, etc. So in summary I care more about "one way" than which way. Most code I write uses This means I see two ways forward:
What do you think? Is there a third option? Sorry for falling off the face of the earth/this PR. Started out as not having anything useful to say because I didn't know what I wanted and then life took over :-/ |
My idea was more of starting a urllib->request migration path than keeping both in the long term, and since So more @betatim 's
kind of thing. Also, in current urlllib-based code, tests work by mocking the So I've quickly started giving a shot at the urllib->requests migration for contentproviders, and the code itself is pretty straightforward. But tests require more work. I'll try to show a (at least partial) PR with this work ASAP. |
BTW I don't understand if |
make sure all the parts that constitute the generated docker image name are escaped, otherwise the docker building process can fail (with a rather hard to understand error). Before this fix, the `provider.content_id` was not escaped.
This content provider allows to retrieve the content from a Software Heritage (SWH) persistent identifier (SWHID). Typical usage: repo2docker swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0 It uses the SWH public vault API to retrieve the content of the given directory. Most of the times, this will not need an authentication token to bypass the rate-limiting of the SWH API. Without authentication, one should be allowed to retrieve one directory content per minute. If this is not enought, then the user must use authenticated calls to the SWH API. For this, a new `swh_token` config item has been added to the Repo2Docker application class. To use authentication: repo2docker --config cfg.json swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0 with the swh_token config option being defined in the cfg.json config file.
Just rebased the PR onto current master |
Finally getting around to merging this. Thanks a lot for the work and patience with the slow reviewing. It isn't the best first impression we could have made :-/ The failing test was related to figshare. |
Thanks @betatim |
Yes please or directly submit a PR if you have the time |
quick attempt jupyterhub/binderhub#1256 |
Add support for the SWHID content provider
This content provider allows to retrieve the content from a
Software Heritage (SWH) persistent identifier (SWHID).
Typical usage:
It uses the SWH public vault API to retrieve the content of the given
directory.
Most of the times, this will not need an authentication
token to bypass the rate-limiting of the SWH API.
Without authentication, one should be allowed to retrieve one
directory content per minute.
If this is not enought, then the user must use authenticated calls to
the SWH API.
For this, a new
swh_token
config item has been added to the Repo2Dockerapplication class.
To use authentication:
with the swh_token config option being defined in the cfg.json config file.