Direct access to content associated with a DOI #2

mfenner · 2017-11-30T09:01:17Z

As a user, I want to be able to download the content associated with a DOI without first going to a landing page, so that I can quickly download a large number of datasets.

DEV NOTES

content negotiation and link headers would be the way to do this, but have never been widely adopted
add a field "content URL" to DataCite URL registration, in addition to the "URL" field that we already have
only scales with a standard packaging format, and the best candidate in terms of functionality and adoption might be bagit.
recommend to provide the bagit item as "application/zip" content type.
register as content type "application/zip", expressed as URL that would be https://data.datacite.org/application/zip/10.x/xyz.
expose all DOIs in a sitemap file, optionally broken down in individual sitemap files for each data center.

mfenner · 2017-11-30T09:21:16Z

Possibly relevant: https://github.com/UTS-eResearch/datacrate by @ptsefton.

max-mapper · 2017-12-01T00:06:41Z

Data.gov does this using the 'resources' field of Project Open Data, which includes a 'downloadURL' https://project-open-data.cio.gov/v1.1/schema/#distribution-downloadURL. They specifically distinguish downloadURL from accessURL which I think is very helpful.

mfenner · 2017-12-01T23:03:27Z

Thanks @maxogden. Schema.org does something similar with http://schema.org/url and http://schema.org/contentUrl - the latter would need an update of the documentation, as it is relevant not only to media objects.

cameronneylon · 2017-12-05T07:29:39Z

As noted in an email from @mfenner it could be good to think about this from the (asymmetrical) view of packager as well as consumer. The question of who does the packaging and what their motivation is, may be quite important in terms of what quality of metadata and completeness can be provided.

As noted above the work on datacrate is interesting in this space. I wrote up some experiences on finding it challenging to package things on the blog (http://cameronneylon.net/blog/walking-the-walk-how-easily-can-a-whole-project-be-shared-and-connected/ http://cameronneylon.net/blog/packaging-data-the-core-problem-in-general-data-sharing/).

There are some good opportunities to think about how this might fit into a general workflow with provenance, metadata being created as the researcher goes.

sckott · 2018-01-03T18:14:16Z

@mfenner is there any scope here to include supplementary files associated with journal articles? Or only works that are datasets themselves?

mfenner · 2018-01-03T19:13:06Z

I think the focus is on DOIs for datasets, but the same process should work for other content types. Supplementary files would be good and I can talk to Figshare whether they are interested.

mfenner · 2018-01-04T11:32:12Z

PLOS is a special case, as they make heavy use of Crossref component DOIs for figures, tables and supplementary files. Will ask Crossref for advice.

dojobo · 2018-01-26T14:38:32Z

Perhaps sort of relevant, at European Southern Observatory we are using a Link header with rel="alternate" on our landing pages. (So far this is for a single resource URL for the DOI... I suppose multiple rel="alternate" Link headers are allowed?) This came from those FORCE11 recommendations I think. Besides being machine-readable, you can also fetch it with a HEAD request. (We use this internally to monitor, with a cron job, that landing pages and resource URLs resolve with a 200.)

An example: https://doi.org/10.18727/0722-6691/5053

mfenner · 2018-01-26T16:30:22Z

Thanks @dojobo. I think this aligns well with the recommendations at http://signposting.org/.

eocarragain · 2018-02-07T15:56:13Z

Also note the citation_pdf_url convention used by crawlers like Google Scholar and OADOI/Unpaywall, Core to harvest actual "data"/pdfs from publishers and institutional repositories. So, for example, a crawler may follow a Handle or CrossRef DOI to a HTML landing page and then look for a meta element in the HTML header to identify the primary bitstream associated with the resource

<meta content="https://my.repo.org/bitstreams/9999/mypaper.pdf" name="citation_pdf_url" />

signposting.org is a much better reference point, but worth noting this for completeness sake.

Google Scholar: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing

mfenner · 2018-02-07T16:04:37Z

In schema.org the relevant attribute would be contentUrl.

eocarragain · 2018-08-10T13:49:46Z

See also draft recommendation of the RDA PID Kernel Information WG (https://www.rd-alliance.org/groups/pid-kernel-information-wg): https://docs.google.com/document/d/1EdS5OCoEWd4VY0HNLHkhzdQojsRgc3P8aWXYQKTqs8M/edit

Note especially:

digitalObjectLocation - "Pointer to the content object location (pointer to the DO). This may be in addition to a pointer to a human-readable landing page for the object"
etag - "Checksum of object contents. Checksum format determined via attribute type referenced in a Kernel Information record."

mfenner · 2018-08-10T15:06:27Z

@eocarragain thanks, this aligns with our thinking, and with the work we are doing on this in the NIH Data Commons.

eocarragain · 2018-08-10T15:35:11Z

It is also worth keeping content-addressed protocols like ipfs.io (and even things like magnet links) in mind. Main thing is to avoid a recommendation which rules these protocols out, as DOI/DataCite could potentially be a nice bridge to these p2p, content-addressed networks in that it provides a trusted mutable record which can point to immutable content-address. Having said that, I don't see anything above which would rule them out as long as the content-addressed protocol can be expressed as the URL scheme.

For IPFS, the DataCite contentURL could point to a single file or a huge directory of content using something like ipfs://babybeiccrv3uc3hjipdnwf4nnntbxuwvt4pn5dsgelvvyueucracbevtha . The content can then be retrieved from any ipfs node which has some or all of the content, and since the "content identifier" string contains the hash of the entire content, verification is built in as part of the protocol (links to some relevant specs: CIDs, Multiformats).

eocarragain · 2018-08-10T15:36:23Z

More generally, to what extent is providing a way to verify downloaded content a requirement (in scope) here, or is the goal only to provide a direct link to a resolvable download? It is somewhat covered by Bagit and by the etag attribute in the RDA PID Kernel document.

mfenner · 2018-08-10T16:14:46Z

We should clarify the scope of this issue, which is what we can provide via DOI metadata and DOI services. We are not planning to go beyond one or more URLs and checksums. The protocols for file downloads used, the verification of downloaded content, and also permissions are out of scope. One contentURLs as part of DOI metadata have become the norm, or at least seen significant uptake, we can start that discussion.

eocarragain · 2018-08-10T17:33:15Z

Sounds good & makes sense. Might be worth breaking this issue into user stories as is happening in recent issues. Cheers

pdurbin · 2018-10-05T15:38:26Z

Here's an example of how to get the URL for a file from Zenodo (using jq to show just the first file):

curl -s -H "Accept: application/ld+json" https://zenodo.org/api/records/1419226 | jq '.distribution[0]'

{
  "@type": "DataDownload",
  "contentUrl": "https://zenodo.org/api/files/149d8cde-076a-478a-a4df-26b061161c36/13.3.17A5_E9_36C_dataset.HDF5",
  "fileFormat": "hdf5"
}

Thank you @cboettig for linking to this conversation from whole-tale/whole-tale#35 (comment)

Hat tip to @jggautier for the mentioning the Zenodo record above at IQSS/dataverse#4371

mfenner · 2018-10-05T16:12:02Z

Thanks @pdurbin. This is functionality provided by Zenodo, unfortunately the contentURL that they provide is not yet part of the DOI metadata they send. We will work with DataCite repositories to provide that information to DataCite so that we can include the content URL in the DOI metadata.

eocarragain · 2018-10-05T17:00:34Z

Also highlights whether the contentURL in the PID metadata should reference the whole thing (e.g. via the packaging formats reference above or simple tar/zip file) or whether an array of contentURLs for each file is allowed. My vote would be for the former (with bonus points if the package contains all relevant PID metadata too), but this could be harder to get adoption/consensus.

mfenner · 2018-10-05T17:03:54Z

My goal is a contentURL referencing a single file, ideally a bagit archive that also includes metadata. I have run into a use case where I need to support multiple contentURLs - the same content in multiple cloud locations (AWS, Google Cloud), but that is an edge case.

eocarragain · 2018-10-30T11:01:06Z

For reference, discussion of the "identifier for digital objects" PID schema being adopted by the Software Heritage archive: https://hal.archives-ouvertes.fr/hal-01865790v4 . Includes using hashes for ensuring the integrity of resolved content.

eocarragain · 2018-10-30T11:12:38Z

Also for reference, see MINIDs: http://minid.bd2k.org/. In addition to an array of locations/urls, the json response has fields for "checksum" and "checksum_function" and "content_key".

stale · 2019-01-28T11:29:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mfenner added the user story label Nov 30, 2017

mfenner added this to the Support for direct access to content milestone Nov 30, 2017

mfenner added the discussion label Nov 30, 2017

mfenner removed discussion labels Dec 29, 2017

mfenner mentioned this issue Jan 3, 2018

How should one get a data download URL from a DOI? ropensci-archive/doidata#1

Open

noamross mentioned this issue Jan 3, 2018

Should we include data from journal supplementary files? ropensci-archive/doidata#2

Open

mfenner mentioned this issue Jan 25, 2018

Validation of identifier datacite/schema#43

Closed

richardhallett added the discussion label Feb 7, 2018

mfenner added the FREYA label Apr 8, 2018

oxinabox mentioned this issue Jun 14, 2018

Add DataCite API oxinabox/DataDepsGenerators.jl#28

Merged

mfenner removed FREYA labels Jul 16, 2018

mfenner added discussion and removed discussion labels Jul 31, 2018

cboettig mentioned this issue Oct 5, 2018

Binder integration whole-tale/whole-tale#35

Open

stale bot added the wontfix label Jan 28, 2019

stale bot closed this as completed Feb 4, 2019

eocarragain mentioned this issue Aug 1, 2019

Use Case: Given a DOI, directly download & verify an RO-Crate ResearchObject/ro-crate#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct access to content associated with a DOI #2

Direct access to content associated with a DOI #2

mfenner commented Nov 30, 2017 •

edited

Loading

mfenner commented Nov 30, 2017 •

edited

Loading

max-mapper commented Dec 1, 2017

mfenner commented Dec 1, 2017 •

edited

Loading

cameronneylon commented Dec 5, 2017

sckott commented Jan 3, 2018

mfenner commented Jan 3, 2018

mfenner commented Jan 4, 2018

dojobo commented Jan 26, 2018

mfenner commented Jan 26, 2018

eocarragain commented Feb 7, 2018 •

edited

Loading

mfenner commented Feb 7, 2018

eocarragain commented Aug 10, 2018

mfenner commented Aug 10, 2018

eocarragain commented Aug 10, 2018 •

edited

Loading

eocarragain commented Aug 10, 2018

mfenner commented Aug 10, 2018

eocarragain commented Aug 10, 2018

pdurbin commented Oct 5, 2018

mfenner commented Oct 5, 2018

eocarragain commented Oct 5, 2018

mfenner commented Oct 5, 2018

eocarragain commented Oct 30, 2018

eocarragain commented Oct 30, 2018

stale bot commented Jan 28, 2019

Direct access to content associated with a DOI #2

Direct access to content associated with a DOI #2

Comments

mfenner commented Nov 30, 2017 • edited Loading

DEV NOTES

mfenner commented Nov 30, 2017 • edited Loading

max-mapper commented Dec 1, 2017

mfenner commented Dec 1, 2017 • edited Loading

cameronneylon commented Dec 5, 2017

sckott commented Jan 3, 2018

mfenner commented Jan 3, 2018

mfenner commented Jan 4, 2018

dojobo commented Jan 26, 2018

mfenner commented Jan 26, 2018

eocarragain commented Feb 7, 2018 • edited Loading

mfenner commented Feb 7, 2018

eocarragain commented Aug 10, 2018

mfenner commented Aug 10, 2018

eocarragain commented Aug 10, 2018 • edited Loading

eocarragain commented Aug 10, 2018

mfenner commented Aug 10, 2018

eocarragain commented Aug 10, 2018

pdurbin commented Oct 5, 2018

mfenner commented Oct 5, 2018

eocarragain commented Oct 5, 2018

mfenner commented Oct 5, 2018

eocarragain commented Oct 30, 2018

eocarragain commented Oct 30, 2018

stale bot commented Jan 28, 2019

mfenner commented Nov 30, 2017 •

edited

Loading

mfenner commented Nov 30, 2017 •

edited

Loading

mfenner commented Dec 1, 2017 •

edited

Loading

eocarragain commented Feb 7, 2018 •

edited

Loading

eocarragain commented Aug 10, 2018 •

edited

Loading