Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a contentprovider for Software Heritage persistent ID (SWHID) #988

Merged
merged 3 commits into from
Jan 26, 2021

Conversation

douardda
Copy link
Contributor

Add support for the SWHID content provider

This content provider allows to retrieve the content from a
Software Heritage (SWH) persistent identifier (SWHID).
Typical usage:

  repo2docker swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

It uses the SWH public vault API to retrieve the content of the given
directory.

Most of the times, this will not need an authentication
token to bypass the rate-limiting of the SWH API.
Without authentication, one should be allowed to retrieve one
directory content per minute.

If this is not enought, then the user must use authenticated calls to
the SWH API.

For this, a new swh_token config item has been added to the Repo2Docker
application class.

To use authentication:

  repo2docker --config cfg.json swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

with the swh_token config option being defined in the cfg.json config file.

@welcome
Copy link

welcome bot commented Nov 26, 2020

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@douardda douardda force-pushed the swhid branch 2 times, most recently from 709ff78 to c64467c Compare November 26, 2020 14:05
@douardda
Copy link
Contributor Author

Uploaded a new version that does not depend on swh.model at all to keep compatibility with python 3.6


from os import path

import requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently don't depend on requests and have used the standard library urllib to make HTTP requests. We should take a moment to review if we want to take on the additional maintenance cost of a new dependency vs sticking with using urllib.

@douardda
Copy link
Contributor Author

douardda commented Nov 27, 2020 via email

setup.py Outdated Show resolved Hide resolved
Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice Pull Request!! I left a few comments about some typos I spotted and questions that came to my mind while reading the code.

I think overall it is in good shape. The biggest question for me is: do we take on a new dependency or do we use urllib. A factor that would decide this is: how much work would it be to perform the HTTP requests without using requests (maybe using the other content providers for inspiration).

I have a slight preference towards not adding a new dependency (even a popular one like requests).

@douardda
Copy link
Contributor Author

FTR I've pushed an updated version of this PR with comments/typos taken care of (but the requests->urllib refactoring)

@douardda
Copy link
Contributor Author

douardda commented Dec 7, 2020

Hi, what's the process to decide whether adding the dependency on requests is acceptable or not?

@manics
Copy link
Member

manics commented Dec 7, 2020

I don't think we have an "official" process.

Looking through your changes it looks like you're using requests.Session which is usually used where you need to have a login or persistent cookie to make requests, but there's no mention of this requirement on https://archive.softwareheritage.org/api/

Based on this it should be easy to switch to from urllib.request import Request, e.g. see

req = Request(
"{}{}".format(host["api"], record_id),
headers={"accept": "application/json"},
)

For a new project requests definitely makes sense as it's easier to use and has more features, but since this is an existing codebase there needs to be a justification for switching, which includes dealing with questions like do we make the switch across all files (best done in a separate PR) or only this one (need to justify the complexity for maintainers in understanding multiple HTTP libraries).

@betatim
Copy link
Member

betatim commented Dec 8, 2020

One other person who had an opinion on this is Min. My summary of his point was "requests is standard, without any other influences it is what we should use.". Reflecting on this a bit I think for me the important thing is that we have one way of doing this in repo2docker, not two (or even more). Having one way means you only need to know one way, consistency amongst the code base, can keep testing infrastructure consistent, etc. So in summary I care more about "one way" than which way.

Most code I write uses requests, nearly none uses urllib.

This means I see two ways forward:

  1. change this PR to use urllib, or
  2. change the other content providers to use requests

What do you think? Is there a third option?

Sorry for falling off the face of the earth/this PR. Started out as not having anything useful to say because I didn't know what I wanted and then life took over :-/

@douardda
Copy link
Contributor Author

Thanks @manics and @betatim

For a new project requests definitely makes sense as it's easier to use and has more features, but since this is an existing codebase there needs to be a justification for switching, which includes dealing with questions like do we make the switch across all files (best done in a separate PR) or only this one (need to justify the complexity for maintainers in understanding multiple HTTP libraries).

My idea was more of starting a urllib->request migration path than keeping both in the long term, and since requests is already used in the tests, the need for dealing with multiple HTTP libraries is de facto already there.

So more @betatim 's

  1. change the other content providers to use requests

kind of thing.

Also, in current urlllib-based code, tests work by mocking the urlopen() method of content providers (eg Dataverse or Figshare) which I think is not ideal because it leaves these methods, which are part of the code under test, untested.
And I find the requests_mock library pretty handy to write unit tests for (requests-based) HTTP dependent features.

So I've quickly started giving a shot at the urllib->requests migration for contentproviders, and the code itself is pretty straightforward. But tests require more work.

I'll try to show a (at least partial) PR with this work ASAP.

@douardda
Copy link
Contributor Author

BTW I don't understand if hdl handlers are supposed to be supported or not. The Dataverse test code uses one of them (and the test passes due to "large spectrum" mocking of the doi2url method) , but I see no code in doi.py nor dataverse.py able to manage them. More specifically, the is_doi() function will return False for a hdl handle, so the actual doi2url() will not be able to resolve it.

make sure all the parts that constitute the generated docker image name
are escaped, otherwise the docker building process can fail (with a rather
hard to understand error).

Before this fix, the `provider.content_id` was not escaped.
This content provider allows to retrieve the content from a
Software Heritage (SWH) persistent identifier (SWHID).
Typical usage:

  repo2docker swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

It uses the SWH public vault API to retrieve the content of the given
directory.

Most of the times, this will not need an authentication
token to bypass the rate-limiting of the SWH API.
Without authentication, one should be allowed to retrieve one
directory content per minute.

If this is not enought, then the user must use authenticated calls to
the SWH API.

For this, a new `swh_token` config item has been added to the Repo2Docker
application class.

To use authentication:

  repo2docker --config cfg.json swh:1:rev:94dca98c006b80309704c717b5d83dff3c1fa3a0

with the swh_token config option being defined in the cfg.json config file.
@douardda
Copy link
Contributor Author

Just rebased the PR onto current master

@betatim
Copy link
Member

betatim commented Jan 26, 2021

Finally getting around to merging this.

Thanks a lot for the work and patience with the slow reviewing. It isn't the best first impression we could have made :-/

The failing test was related to figshare.

@douardda
Copy link
Contributor Author

Thanks @betatim
Now the next step will be to add support for swhid in binderhub, right? should I create an issue there?

@betatim
Copy link
Member

betatim commented Jan 27, 2021

Yes please or directly submit a PR if you have the time

@douardda
Copy link
Contributor Author

quick attempt jupyterhub/binderhub#1256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants