Add a contentprovider for Software Heritage persistent ID (SWHID) #988

douardda · 2020-11-26T10:20:50Z

Add support for the SWHID content provider

This content provider allows to retrieve the content from a
Software Heritage (SWH) persistent identifier (SWHID).
Typical usage:

  repo2docker swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

It uses the SWH public vault API to retrieve the content of the given
directory.

Most of the times, this will not need an authentication
token to bypass the rate-limiting of the SWH API.
Without authentication, one should be allowed to retrieve one
directory content per minute.

If this is not enought, then the user must use authenticated calls to
the SWH API.

For this, a new swh_token config item has been added to the Repo2Docker
application class.

To use authentication:

  repo2docker --config cfg.json swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

with the swh_token config option being defined in the cfg.json config file.

welcome · 2020-11-26T10:20:53Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

douardda · 2020-11-26T14:06:45Z

Uploaded a new version that does not depend on swh.model at all to keep compatibility with python 3.6

betatim · 2020-11-27T09:24:56Z

repo2docker/contentproviders/swhid.py

+
+from os import path
+
+import requests


We currently don't depend on requests and have used the standard library urllib to make HTTP requests. We should take a moment to review if we want to take on the additional maintenance cost of a new dependency vs sticking with using urllib.

repo2docker/app.py

douardda · 2020-11-27T09:30:11Z

Yes I was pretty sure it would be a subject for discussion :-) Depending on requests seems pretty straightforward and add very little "dependency hell" material to me, but I can understand your point. David

…

On 27/11/2020 10:25, Tim Head wrote: ***@***.**** commented on this pull request. ------------------------------------------------------------------------ In repo2docker/contentproviders/swhid.py <#988 (comment)>: > @@ -0,0 +1,113 @@ +import io +import os +import shutil +import tarfile +import time +import re + +from os import path + +import requests We currently don't depend on |requests| and have used the standard library |urllib| to make HTTP requests. We should take a moment to review if we want to take on the additional maintenance cost of a new dependency vs sticking with using urllib. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#988 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALNKIUHU632VGS7FWTRLDTSR5V7NANCNFSM4UDQ67VQ>.

repo2docker/contentproviders/swhid.py

setup.py

tests/unit/contentproviders/test_swhid.py

betatim

This is a nice Pull Request!! I left a few comments about some typos I spotted and questions that came to my mind while reading the code.

I think overall it is in good shape. The biggest question for me is: do we take on a new dependency or do we use urllib. A factor that would decide this is: how much work would it be to perform the HTTP requests without using requests (maybe using the other content providers for inspiration).

I have a slight preference towards not adding a new dependency (even a popular one like requests).

douardda · 2020-11-30T10:54:50Z

FTR I've pushed an updated version of this PR with comments/typos taken care of (but the requests->urllib refactoring)

douardda · 2020-12-07T09:08:32Z

Hi, what's the process to decide whether adding the dependency on requests is acceptable or not?

manics · 2020-12-07T11:46:48Z

I don't think we have an "official" process.

Looking through your changes it looks like you're using requests.Session which is usually used where you need to have a login or persistent cookie to make requests, but there's no mention of this requirement on https://archive.softwareheritage.org/api/

Based on this it should be easy to switch to from urllib.request import Request, e.g. see

repo2docker/repo2docker/contentproviders/zenodo.py

Lines 58 to 61 in e33d5f8

    
           req = Request( 
        
               "{}{}".format(host["api"], record_id), 
        
               headers={"accept": "application/json"}, 
        
           )

For a new project requests definitely makes sense as it's easier to use and has more features, but since this is an existing codebase there needs to be a justification for switching, which includes dealing with questions like do we make the switch across all files (best done in a separate PR) or only this one (need to justify the complexity for maintainers in understanding multiple HTTP libraries).

betatim · 2020-12-08T11:41:42Z

One other person who had an opinion on this is Min. My summary of his point was "requests is standard, without any other influences it is what we should use.". Reflecting on this a bit I think for me the important thing is that we have one way of doing this in repo2docker, not two (or even more). Having one way means you only need to know one way, consistency amongst the code base, can keep testing infrastructure consistent, etc. So in summary I care more about "one way" than which way.

Most code I write uses requests, nearly none uses urllib.

This means I see two ways forward:

change this PR to use urllib, or
change the other content providers to use requests

What do you think? Is there a third option?

Sorry for falling off the face of the earth/this PR. Started out as not having anything useful to say because I didn't know what I wanted and then life took over :-/

douardda · 2020-12-10T09:12:15Z

Thanks @manics and @betatim

For a new project requests definitely makes sense as it's easier to use and has more features, but since this is an existing codebase there needs to be a justification for switching, which includes dealing with questions like do we make the switch across all files (best done in a separate PR) or only this one (need to justify the complexity for maintainers in understanding multiple HTTP libraries).

My idea was more of starting a urllib->request migration path than keeping both in the long term, and since requests is already used in the tests, the need for dealing with multiple HTTP libraries is de facto already there.

So more @betatim 's

change the other content providers to use requests

kind of thing.

Also, in current urlllib-based code, tests work by mocking the urlopen() method of content providers (eg Dataverse or Figshare) which I think is not ideal because it leaves these methods, which are part of the code under test, untested.
And I find the requests_mock library pretty handy to write unit tests for (requests-based) HTTP dependent features.

So I've quickly started giving a shot at the urllib->requests migration for contentproviders, and the code itself is pretty straightforward. But tests require more work.

I'll try to show a (at least partial) PR with this work ASAP.

douardda · 2020-12-10T11:09:34Z

BTW I don't understand if hdl handlers are supposed to be supported or not. The Dataverse test code uses one of them (and the test passes due to "large spectrum" mocking of the doi2url method) , but I see no code in doi.py nor dataverse.py able to manage them. More specifically, the is_doi() function will return False for a hdl handle, so the actual doi2url() will not be able to resolve it.

make sure all the parts that constitute the generated docker image name are escaped, otherwise the docker building process can fail (with a rather hard to understand error). Before this fix, the `provider.content_id` was not escaped.

This content provider allows to retrieve the content from a Software Heritage (SWH) persistent identifier (SWHID). Typical usage: repo2docker swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0 It uses the SWH public vault API to retrieve the content of the given directory. Most of the times, this will not need an authentication token to bypass the rate-limiting of the SWH API. Without authentication, one should be allowed to retrieve one directory content per minute. If this is not enought, then the user must use authenticated calls to the SWH API. For this, a new `swh_token` config item has been added to the Repo2Docker application class. To use authentication: repo2docker --config cfg.json swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0 with the swh_token config option being defined in the cfg.json config file.

douardda · 2021-01-19T17:06:43Z

Just rebased the PR onto current master

betatim · 2021-01-26T12:50:39Z

Finally getting around to merging this.

Thanks a lot for the work and patience with the slow reviewing. It isn't the best first impression we could have made :-/

The failing test was related to figshare.

douardda · 2021-01-26T14:42:23Z

Thanks @betatim
Now the next step will be to add support for swhid in binderhub, right? should I create an issue there?

betatim · 2021-01-27T09:12:32Z

Yes please or directly submit a PR if you have the time

douardda · 2021-01-27T14:31:29Z

quick attempt jupyterhub/binderhub#1256

douardda force-pushed the swhid branch 2 times, most recently from 709ff78 to c64467c Compare November 26, 2020 14:05

douardda mentioned this pull request Nov 26, 2020

Add support for Software Heritage Identifiers (SWHID) as source of repository jupyterhub/mybinder.org-user-guide#219

Open