-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct access to content associated with a DOI #2
Comments
Possibly relevant: https://github.com/UTS-eResearch/datacrate by @ptsefton. |
Data.gov does this using the 'resources' field of Project Open Data, which includes a 'downloadURL' https://project-open-data.cio.gov/v1.1/schema/#distribution-downloadURL. They specifically distinguish downloadURL from accessURL which I think is very helpful. |
Thanks @maxogden. Schema.org does something similar with http://schema.org/url and http://schema.org/contentUrl - the latter would need an update of the documentation, as it is relevant not only to media objects. |
As noted in an email from @mfenner it could be good to think about this from the (asymmetrical) view of packager as well as consumer. The question of who does the packaging and what their motivation is, may be quite important in terms of what quality of metadata and completeness can be provided. As noted above the work on datacrate is interesting in this space. I wrote up some experiences on finding it challenging to package things on the blog (http://cameronneylon.net/blog/walking-the-walk-how-easily-can-a-whole-project-be-shared-and-connected/ http://cameronneylon.net/blog/packaging-data-the-core-problem-in-general-data-sharing/). There are some good opportunities to think about how this might fit into a general workflow with provenance, metadata being created as the researcher goes. |
@mfenner is there any scope here to include supplementary files associated with journal articles? Or only works that are datasets themselves? |
I think the focus is on DOIs for datasets, but the same process should work for other content types. Supplementary files would be good and I can talk to Figshare whether they are interested. |
PLOS is a special case, as they make heavy use of Crossref component DOIs for figures, tables and supplementary files. Will ask Crossref for advice. |
Perhaps sort of relevant, at European Southern Observatory we are using a Link header with rel="alternate" on our landing pages. (So far this is for a single resource URL for the DOI... I suppose multiple rel="alternate" Link headers are allowed?) This came from those FORCE11 recommendations I think. Besides being machine-readable, you can also fetch it with a HEAD request. (We use this internally to monitor, with a cron job, that landing pages and resource URLs resolve with a 200.) An example: https://doi.org/10.18727/0722-6691/5053 |
Thanks @dojobo. I think this aligns well with the recommendations at http://signposting.org/. |
Also note the citation_pdf_url convention used by crawlers like Google Scholar and OADOI/Unpaywall, Core to harvest actual "data"/pdfs from publishers and institutional repositories. So, for example, a crawler may follow a Handle or CrossRef DOI to a HTML landing page and then look for a meta element in the HTML header to identify the primary bitstream associated with the resource
signposting.org is a much better reference point, but worth noting this for completeness sake. Google Scholar: https://scholar.google.com/intl/en/scholar/inclusion.html#indexing |
In schema.org the relevant attribute would be contentUrl. |
See also draft recommendation of the RDA PID Kernel Information WG (https://www.rd-alliance.org/groups/pid-kernel-information-wg): https://docs.google.com/document/d/1EdS5OCoEWd4VY0HNLHkhzdQojsRgc3P8aWXYQKTqs8M/edit Note especially:
|
@eocarragain thanks, this aligns with our thinking, and with the work we are doing on this in the NIH Data Commons. |
It is also worth keeping content-addressed protocols like ipfs.io (and even things like magnet links) in mind. Main thing is to avoid a recommendation which rules these protocols out, as DOI/DataCite could potentially be a nice bridge to these p2p, content-addressed networks in that it provides a trusted mutable record which can point to immutable content-address. Having said that, I don't see anything above which would rule them out as long as the content-addressed protocol can be expressed as the URL scheme. For IPFS, the DataCite contentURL could point to a single file or a huge directory of content using something like ipfs://babybeiccrv3uc3hjipdnwf4nnntbxuwvt4pn5dsgelvvyueucracbevtha . The content can then be retrieved from any ipfs node which has some or all of the content, and since the "content identifier" string contains the hash of the entire content, verification is built in as part of the protocol (links to some relevant specs: CIDs, Multiformats). |
More generally, to what extent is providing a way to verify downloaded content a requirement (in scope) here, or is the goal only to provide a direct link to a resolvable download? It is somewhat covered by Bagit and by the etag attribute in the RDA PID Kernel document. |
We should clarify the scope of this issue, which is what we can provide via DOI metadata and DOI services. We are not planning to go beyond one or more URLs and checksums. The protocols for file downloads used, the verification of downloaded content, and also permissions are out of scope. One contentURLs as part of DOI metadata have become the norm, or at least seen significant uptake, we can start that discussion. |
Sounds good & makes sense. Might be worth breaking this issue into user stories as is happening in recent issues. Cheers |
Here's an example of how to get the URL for a file from Zenodo (using jq to show just the first file):
Thank you @cboettig for linking to this conversation from whole-tale/whole-tale#35 (comment) Hat tip to @jggautier for the mentioning the Zenodo record above at IQSS/dataverse#4371 |
Thanks @pdurbin. This is functionality provided by Zenodo, unfortunately the contentURL that they provide is not yet part of the DOI metadata they send. We will work with DataCite repositories to provide that information to DataCite so that we can include the content URL in the DOI metadata. |
Also highlights whether the contentURL in the PID metadata should reference the whole thing (e.g. via the packaging formats reference above or simple tar/zip file) or whether an array of contentURLs for each file is allowed. My vote would be for the former (with bonus points if the package contains all relevant PID metadata too), but this could be harder to get adoption/consensus. |
My goal is a contentURL referencing a single file, ideally a bagit archive that also includes metadata. I have run into a use case where I need to support multiple contentURLs - the same content in multiple cloud locations (AWS, Google Cloud), but that is an edge case. |
For reference, discussion of the "identifier for digital objects" PID schema being adopted by the Software Heritage archive: https://hal.archives-ouvertes.fr/hal-01865790v4 . Includes using hashes for ensuring the integrity of resolved content. |
Also for reference, see MINIDs: http://minid.bd2k.org/. In addition to an array of locations/urls, the json response has fields for "checksum" and "checksum_function" and "content_key". |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
As a user, I want to be able to download the content associated with a DOI without first going to a landing page, so that I can quickly download a large number of datasets.
DEV NOTES
The text was updated successfully, but these errors were encountered: