Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct access to content associated with a DOI #2

Closed
mfenner opened this issue Nov 30, 2017 · 24 comments
Closed

Direct access to content associated with a DOI #2

mfenner opened this issue Nov 30, 2017 · 24 comments

Comments

@mfenner
Copy link
Contributor

mfenner commented Nov 30, 2017

As a user, I want to be able to download the content associated with a DOI without first going to a landing page, so that I can quickly download a large number of datasets.

DEV NOTES

  • content negotiation and link headers would be the way to do this, but have never been widely adopted
  • add a field "content URL" to DataCite URL registration, in addition to the "URL" field that we already have
  • only scales with a standard packaging format, and the best candidate in terms of functionality and adoption might be bagit.
  • recommend to provide the bagit item as "application/zip" content type.
  • register as content type "application/zip", expressed as URL that would be https://data.datacite.org/application/zip/10.x/xyz.
  • expose all DOIs in a sitemap file, optionally broken down in individual sitemap files for each data center.
@mfenner mfenner added this to the Support for direct access to content milestone Nov 30, 2017
@mfenner
Copy link
Contributor Author

mfenner commented Nov 30, 2017

Possibly relevant: https://github.com/UTS-eResearch/datacrate by @ptsefton.

@max-mapper
Copy link

Data.gov does this using the 'resources' field of Project Open Data, which includes a 'downloadURL' https://project-open-data.cio.gov/v1.1/schema/#distribution-downloadURL. They specifically distinguish downloadURL from accessURL which I think is very helpful.

@mfenner
Copy link
Contributor Author

mfenner commented Dec 1, 2017

Thanks @maxogden. Schema.org does something similar with http://schema.org/url and http://schema.org/contentUrl - the latter would need an update of the documentation, as it is relevant not only to media objects.

@cameronneylon
Copy link

As noted in an email from @mfenner it could be good to think about this from the (asymmetrical) view of packager as well as consumer. The question of who does the packaging and what their motivation is, may be quite important in terms of what quality of metadata and completeness can be provided.

As noted above the work on datacrate is interesting in this space. I wrote up some experiences on finding it challenging to package things on the blog (http://cameronneylon.net/blog/walking-the-walk-how-easily-can-a-whole-project-be-shared-and-connected/ http://cameronneylon.net/blog/packaging-data-the-core-problem-in-general-data-sharing/).

There are some good opportunities to think about how this might fit into a general workflow with provenance, metadata being created as the researcher goes.

@sckott
Copy link

sckott commented Jan 3, 2018

@mfenner is there any scope here to include supplementary files associated with journal articles? Or only works that are datasets themselves?

@mfenner
Copy link
Contributor Author

mfenner commented Jan 3, 2018

I think the focus is on DOIs for datasets, but the same process should work for other content types. Supplementary files would be good and I can talk to Figshare whether they are interested.

@mfenner
Copy link
Contributor Author

mfenner commented Jan 4, 2018

PLOS is a special case, as they make heavy use of Crossref component DOIs for figures, tables and supplementary files. Will ask Crossref for advice.

@dojobo
Copy link

dojobo commented Jan 26, 2018

Perhaps sort of relevant, at European Southern Observatory we are using a Link header with rel="alternate" on our landing pages. (So far this is for a single resource URL for the DOI... I suppose multiple rel="alternate" Link headers are allowed?) This came from those FORCE11 recommendations I think. Besides being machine-readable, you can also fetch it with a HEAD request. (We use this internally to monitor, with a cron job, that landing pages and resource URLs resolve with a 200.)

An example: https://doi.org/10.18727/0722-6691/5053

@mfenner
Copy link
Contributor Author

mfenner commented Jan 26, 2018

Thanks @dojobo. I think this aligns well with the recommendations at http://signposting.org/.

@eocarragain
Copy link

eocarragain commented Feb 7, 2018

Also note the citation_pdf_url convention used by crawlers like Google Scholar and OADOI/Unpaywall, Core to harvest actual "data"/pdfs from publishers and institutional repositories. So, for example, a crawler may follow a Handle or CrossRef DOI to a HTML landing page and then look for a meta element in the HTML header to identify the primary bitstream associated with the resource

<meta content="https://my.repo.org/bitstreams/9999/mypaper.pdf" name="citation_pdf_url" />

signposting.org is a much better reference point, but worth noting this for completeness sake.

Google Scholar: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

@mfenner
Copy link
Contributor Author

mfenner commented Feb 7, 2018

In schema.org the relevant attribute would be contentUrl.

@eocarragain
Copy link

See also draft recommendation of the RDA PID Kernel Information WG (https://www.rd-alliance.org/groups/pid-kernel-information-wg): https://docs.google.com/document/d/1EdS5OCoEWd4VY0HNLHkhzdQojsRgc3P8aWXYQKTqs8M/edit

Note especially:

  • digitalObjectLocation - "Pointer to the content object location (pointer to the DO). This may be in addition to a pointer to a human-readable landing page for the object"
  • etag - "Checksum of object contents. Checksum format determined via attribute type referenced in a Kernel Information record."

@mfenner
Copy link
Contributor Author

mfenner commented Aug 10, 2018

@eocarragain thanks, this aligns with our thinking, and with the work we are doing on this in the NIH Data Commons.

@eocarragain
Copy link

eocarragain commented Aug 10, 2018

It is also worth keeping content-addressed protocols like ipfs.io (and even things like magnet links) in mind. Main thing is to avoid a recommendation which rules these protocols out, as DOI/DataCite could potentially be a nice bridge to these p2p, content-addressed networks in that it provides a trusted mutable record which can point to immutable content-address. Having said that, I don't see anything above which would rule them out as long as the content-addressed protocol can be expressed as the URL scheme.

For IPFS, the DataCite contentURL could point to a single file or a huge directory of content using something like ipfs://babybeiccrv3uc3hjipdnwf4nnntbxuwvt4pn5dsgelvvyueucracbevtha . The content can then be retrieved from any ipfs node which has some or all of the content, and since the "content identifier" string contains the hash of the entire content, verification is built in as part of the protocol (links to some relevant specs: CIDs, Multiformats).

@eocarragain
Copy link

More generally, to what extent is providing a way to verify downloaded content a requirement (in scope) here, or is the goal only to provide a direct link to a resolvable download? It is somewhat covered by Bagit and by the etag attribute in the RDA PID Kernel document.

@mfenner
Copy link
Contributor Author

mfenner commented Aug 10, 2018

We should clarify the scope of this issue, which is what we can provide via DOI metadata and DOI services. We are not planning to go beyond one or more URLs and checksums. The protocols for file downloads used, the verification of downloaded content, and also permissions are out of scope. One contentURLs as part of DOI metadata have become the norm, or at least seen significant uptake, we can start that discussion.

@eocarragain
Copy link

Sounds good & makes sense. Might be worth breaking this issue into user stories as is happening in recent issues. Cheers

@pdurbin
Copy link

pdurbin commented Oct 5, 2018

Here's an example of how to get the URL for a file from Zenodo (using jq to show just the first file):

curl -s -H "Accept: application/ld+json" https://zenodo.org/api/records/1419226 | jq '.distribution[0]'

{
  "@type": "DataDownload",
  "contentUrl": "https://zenodo.org/api/files/149d8cde-076a-478a-a4df-26b061161c36/13.3.17A5_E9_36C_dataset.HDF5",
  "fileFormat": "hdf5"
}

Thank you @cboettig for linking to this conversation from whole-tale/whole-tale#35 (comment)

Hat tip to @jggautier for the mentioning the Zenodo record above at IQSS/dataverse#4371

@mfenner
Copy link
Contributor Author

mfenner commented Oct 5, 2018

Thanks @pdurbin. This is functionality provided by Zenodo, unfortunately the contentURL that they provide is not yet part of the DOI metadata they send. We will work with DataCite repositories to provide that information to DataCite so that we can include the content URL in the DOI metadata.

@eocarragain
Copy link

Also highlights whether the contentURL in the PID metadata should reference the whole thing (e.g. via the packaging formats reference above or simple tar/zip file) or whether an array of contentURLs for each file is allowed. My vote would be for the former (with bonus points if the package contains all relevant PID metadata too), but this could be harder to get adoption/consensus.

@mfenner
Copy link
Contributor Author

mfenner commented Oct 5, 2018

My goal is a contentURL referencing a single file, ideally a bagit archive that also includes metadata. I have run into a use case where I need to support multiple contentURLs - the same content in multiple cloud locations (AWS, Google Cloud), but that is an edge case.

@eocarragain
Copy link

For reference, discussion of the "identifier for digital objects" PID schema being adopted by the Software Heritage archive: https://hal.archives-ouvertes.fr/hal-01865790v4 . Includes using hashes for ensuring the integrity of resolved content.

@eocarragain
Copy link

Also for reference, see MINIDs: http://minid.bd2k.org/. In addition to an array of locations/urls, the json response has fields for "checksum" and "checksum_function" and "content_key".

@stale
Copy link

stale bot commented Jan 28, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants