Add DataCite API #28

SebastinSanty · 2018-06-13T20:25:42Z

Integration Tests to be added after your first review. Secondly, I am not able to get the urls. Do you have an idea how to get it? There are some hints regarding resource-type etc.

oxinabox · 2018-06-14T05:30:37Z

I don't think it is actually possible to get a download URL out of datacite.
I kinda knew that going in.
This also of-course means integration tests are not possible.
(Since resolving the URL correctly is most of what we are testing with those.)

Take a look at datacite/freya#2
where @mfenner is talking about providing it via content-negotiation for "application/zip"
but it is not done yet
(I believe datacite allows for content negotiation via URL as well as via header which is nice)

Right now I think our go is to provide 95% of the registration block,
i.e. everything apart from the URL and checksum,
then let the user go to website (The DOI's landing page) find the link manually, and then edit the generated code.

Editting the generated code is already part of our normal usage anyway, as they likely want to change the datadep name and probably edit the message.

oxinabox

This is looking really cool.
Do you think we should check the schema at the top of the file?
I'm not sure.
I think maybe we should check it and issue a warning.
this is for v4,
it would work for v3 (which fig share used), mostly, but not perfectly.
So maybe issuing a warning if not match would be the right thing to do.
Since it will make future debugging easier.

I am excited to get zenodo support;
Supporting and more generally support for anything with a DOI issues by DataCite (and since DataCite is one of the largest DOI issuers, particularly for Data that is great)

Though as DataCite can't do URLs this becomes a API of last resort.
Still pretty exciting.

Sadly it doesn't work for Figshare
Try e.g 10.1371/journal.pone.0047999.t004
Figshare does support a slightly older version of the DataCite scheme (e.g. https://figshare.com/articles/225779/1/citations/datacite)
but it's DOI's do not hit the anything on the api of datacite, so I guess it uses a different issuer.

I think making figshare work (along with or prior to making DataVerse work) is the next stage.
Maybe OAI-PMH will help us for that. Maybe not.

oxinabox · 2018-06-14T05:38:35Z

src/DataCite.jl

+
+function mainpage_url(repo::DataCite, dataname)
+    try
+        identifier = match(r"\b(10[.][0-9]{4,}(?:[.][0-9]+)*\/(?:(?![\"&\'<>])\S)+)\b", dataname).match


I am guessing this is the regex for as DOI?
Do we use that somewhere else too? (DataDryad?)
I think it would be good to make a function match_doi or similar that does this line
for code understandability

Yes, this is a regex for DOI. I included this to give flexibility to the user to put any url/text with DOI as DataCite supports them all. We use a regex in DataDryad, but that is very specific for a DOI for DataDryad only. Should I still make it into one function?

I think so yes. I suspect infastructurally DataDryad is not bound to always use DOI numbers with their prefix (I think they probably always will, but maybe not),
and if someone has said it is a DataDryad repo, and given a nondatadryad DOI then that is kinda on them.
We should at some point check the "Unhappy Path" when people do that, and give good error messages. Right now I don't think we do so anyway, even with distinguishing Dryad DOIs from DataCite DOIs.
In anycase, see my point below that maybe this method can be deleted anyway once you have base_url?

oxinabox · 2018-06-14T05:56:44Z

src/DataCite.jl

+    author = format_authors(authors)
+    license = attributes["license"]
+    date = attributes["published"]
+    paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])


break this into two lines

Still need to break this in two

oxinabox · 2018-06-14T05:57:07Z

src/DataCite.jl

+    date = attributes["published"]
+    paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])
+
+    final = escape_multiline_string("""


There is no need to assign this to a variable

oxinabox · 2018-06-14T06:01:06Z

src/DataCite.jl

+end
+
+function website(repo::DataCite, mainpage_url)
+    replace(mainpage_url, "https://api.datacite.org/works/", "https://doi.org/")


I think you want to do

base_url(::DataCite) = "https://api.datacite.org/works/"

And use that both in website and mainpage_url.

Though don't we have a fallback for mainpage_url that basically does what your are doing below,
minus the validation that it is a DOI.
And doesn't that fallback use base_url?
In which case we can define base_url then remove the definition of mainpage_url entirely.

Right, what about the validation then? Should I remove it?

I think so yes.
I'm not sure if it is actually useful right now.
Is it?
What kind of mistakes is it catching I guess is the question

So, I implemented it so that we can provide anything, maybe the search url, or datacite api url, or just the DOI. In any case, it should be able to work.

oxinabox · 2018-06-14T06:25:18Z

test/references/DataCite Fire Patch.txt

+	Pierre Laurent, Florent Mouillot, Chao Yue, Maria Vanesa Moreno Dominguez, Philippe Ciais, Joana M.P. Nogueira (2018). List of fire patch properties computed and associated NetCDF maps from the MCD64A1 Collection 6 (2000-2016) and the MERIS fire_cci v4.1 (2005-2011) BA products [Data set]. OSU OREME. https://doi.org/10.15148/0e999ffc-e220-41ac-ac85-76e92ecd0320
+	if you use this in your research.
+
+


I think we should add at the top level generate function for all datadeps a strip to remove whitespace from ends of description

oxinabox · 2018-06-14T06:28:54Z

src/DataCite.jl

+    author = format_authors(authors)
+    license = attributes["license"]
+    date = attributes["published"]
+    paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])


Anytime we have a DOI and want a citation we should do it via
method discussed in #22 (comment)

This is fine for now,
but maybe after this PR make another one that does through and replaces all paper formatting with something based on that kind of idea?

That's the reason I introduced the format_papers() method. I subsequently can add whatever format is required.

oxinabox · 2018-06-15T03:32:08Z

Hmm
what is actually going on with Figshare.
Looks like they do use DataCite generated DOIs (See https://stats.datacite.org/?fq=allocator_facet%3A%22FIGSHARE+-+figshare%22&#tab-datacentres)

And the following works:
https://api.datacite.org/works/10.6084/m9.figshare.5350216.v1

What does not is:

https://figshare.com/articles/_Comparison_of_SEHC_Trauma_Activation_Patients_and_SEHC_Trauma_Nonactivation_Patients_minimum_alcohol_and_illicit_drug_rates_/225779

Which is associated with the doi: 10.1371/journal.pone.0047999.t004
Which resolves to http://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0047999.t004
which is the same table, but on a different site.

So I am guessing figshare rehosted that existing data, with its existing DOI.
And so it was never issued a datacite DOI number, which means it does not work with their API.

CrossRef issued that DOI:
Their API, is not so great for this
https://api.crossref.org/v1/works/10.1371/journal.pone.0047999.t004

I don't think we can content negotiate anything better
See https://citation.crosscite.org/docs.html
I tried a few.

It might be nice to support DOIs in general via the content-negotiation method.
But the things you can get out of any of the providers except DataCite seem less unuseful.
(Surprising really since we're only getting basic metadata. So maybe it is just this on entry (10.1371/journal.pone.0047999.t004) that has poor metadata)

SebastinSanty · 2018-06-15T17:17:24Z

So writing down whatever I understood and plan to implement based on your points. Please correct me if I am wrong:

In the case we don't get the target site on DataCite ("status": "404"), we'll cross-negotiate.
We'll use application/vnd.citationstyles.csl+json as the Accept: for the GET request we make. This gives a JSON as result for which we already have a provision to parse. There is no XML format combined for all of them (CrossCite, DataCite, mEDRA). We have RDF:XML, but I wouldn't prefer using that because that'll be another pain/overhead.
On getting the results, check for the source attribute. Accordingly, send a request to the source's API and get the final results for creating the register block

I faced an issue though, I tried doing content negotiation as described above. But unfortunately the content-negotiation results which came for DataCite didn't contain the source attribute. For cross-ref it is working properly.

oxinabox · 2018-06-16T02:26:23Z

So writing down whatever I understood and plan to implement based on your points. Please correct me if I am wrong:

Good idea checking. I seem to have mislead you.
#29 is a separate issue. It would be to create a different generator call it DOI <: DataRepo.
Seperately from what you've made here DataCite <: DataRepo.
Like how we have many ways to generate for DataDryad (DataCite, DataDryad, DataDryadWeb),
a DOI generator would be an alternative.
If it is a good, one (which I think it can be) it could mean that we delete the current DataCite generator just to save on maintenance.

The goal of this PR #28 is to add DataCite support, it has done that successfully (well no URL, I suspected that wasn't going to be possible).
Once you fix up the the few small things discussed in the review, then this should be good to merge.

#29 may or may not be the best next issue to pursue after this one.
I'ld like to see full support for Figshare and DataVerse.
OAI-PMH is one path that might do it (though I suspect it also won't let use actually get download URLs)
#30 will do figshare (and others) fully but not DataVerse.

BTW: cross-negotiate isn't a term that I am familiar with. I think you mean content negotiate

oxinabox · 2018-06-16T02:26:29Z

For This PR. Something I think I missed in the code-review before:

it should displace some kind of info("DataCite based generation can only generate partial registration blocks, as DataCite metadata does not (currently) include the URL to the resource. You will have to edit in the URL after generation.")
And it should probably stick in the place as the URL a something like "PUT DOWNLOAD URL HERE".

Looks like the test failure are something to do with the Github generator breaking.

SebastinSanty · 2018-06-16T16:15:05Z

Its good that I asked before implementing it in the PR, saved some work of removing it.

SebastinSanty · 2018-06-18T16:38:35Z

Need to merge #31 before this.

oxinabox · 2018-06-19T10:43:52Z

src/DataDepsGenerators.jl

@@ -114,6 +114,11 @@ function format_papers(authors::Vector, year::String, name::String, link::String
    join(authors, ", ") * " ($year). " * name * " " * link
 end

+function check_dois(uri::String)


doi singlular, as it is only one.

and check isn't right.
Check implies that this returns true if something is a DOI, and false if not.

extract, try_extract or match (since this corresponds to the regex), are better.
Maybe get_match

oxinabox · 2018-06-19T10:47:35Z

src/DataCite.jl

@@ -38,13 +40,13 @@ function data_fullname(::DataCite, mainpage)
 end

 function website(repo::DataCite, mainpage_url)
-    replace(mainpage_url, "https://api.datacite.org/works/", "https://doi.org/")
+    replace(mainpage_url, base_url(repo), "https://doi.org/")
 end

 function mainpage_url(repo::DataCite, dataname)


For my own future reference,
this is infact different in functionality to
https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/DataDepsGenerators.jl#L113-L122

This one takes anything that countains a DOI, e.g. some other URL, and gets the DOI from it.
Where are that one only works if the correct full URL, or just a DOI is passed.

SebastinSanty · 2018-06-21T15:28:59Z

@oxinabox Ready to merged if you don't have any reviews.

oxinabox

Looks like some things brought up earlier got forgotten in data discussions and in sorting out bugs.
Just small things so should be trivial to fix

oxinabox · 2018-06-21T16:15:28Z

src/DataCite.jl

+    author = format_authors(authors)
+    license = attributes["license"]
+    date = attributes["published"]
+    paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])


Still need to break this in two

oxinabox · 2018-06-21T16:19:40Z

src/DataCite.jl

+end
+
+function get_urls(repo::DataCite, page)
+    urls = []


Produce a info here saying something about DataCite API does not support URL, URL must be added after manually.

And add a dummy output like "URL HERE", maybe

oxinabox · 2018-06-21T16:23:15Z

src/DataCite.jl

+    Please cite this paper:
+    $(paper)
+    if you use this in your research.
+


Delete blank line

codecov-io · 2018-06-21T16:39:06Z

Codecov Report

Merging #28 into master will increase coverage by 0.15%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
+ Coverage   93.93%   94.09%   +0.15%     
==========================================
  Files          13       14       +1     
  Lines         231      254      +23     
==========================================
+ Hits          217      239      +22     
- Misses         14       15       +1

Impacted Files	Coverage Δ
src/DataDepsGenerators.jl	`94.28% <100%> (+0.53%)`	⬆️
src/DataCite.jl	`95% <95%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a004321...b6ca9d8. Read the comment docs.

oxinabox

It tests pass then all good.
Nice work

oxinabox reviewed Jun 14, 2018

View reviewed changes

oxinabox mentioned this pull request Jun 15, 2018

Support *all* DOIs? (CrossRef and DataCite, at least), via Content Negotiation for RDF? #29

Closed

oxinabox reviewed Jun 19, 2018

View reviewed changes

SebastinSanty added 3 commits June 21, 2018 12:17

Add DataCite API

3aad3cb

Make suggested changes

bb67a6e

Change to match_doi

17c8e14

SebastinSanty force-pushed the datacite branch from c62c626 to 17c8e14 Compare June 21, 2018 11:33

oxinabox reviewed Jun 21, 2018

View reviewed changes

Fix suggested changes

b6ca9d8

oxinabox approved these changes Jun 21, 2018

View reviewed changes

SebastinSanty merged commit e3fdf55 into oxinabox:master Jun 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DataCite API #28

Add DataCite API #28

SebastinSanty commented Jun 13, 2018 •

edited

Loading

oxinabox commented Jun 14, 2018 •

edited

Loading

oxinabox left a comment

oxinabox Jun 14, 2018

SebastinSanty Jun 14, 2018

oxinabox Jun 15, 2018

oxinabox Jun 14, 2018

oxinabox Jun 21, 2018

oxinabox Jun 14, 2018

oxinabox Jun 14, 2018

SebastinSanty Jun 15, 2018 •

edited

Loading

oxinabox Jun 16, 2018 •

edited

Loading

SebastinSanty Jun 16, 2018

oxinabox Jun 14, 2018

oxinabox Jun 14, 2018

SebastinSanty Jun 14, 2018

oxinabox commented Jun 15, 2018

SebastinSanty commented Jun 15, 2018

oxinabox commented Jun 16, 2018

oxinabox commented Jun 16, 2018 •

edited

Loading

SebastinSanty commented Jun 16, 2018

SebastinSanty commented Jun 18, 2018

oxinabox Jun 19, 2018

oxinabox Jun 19, 2018

SebastinSanty commented Jun 21, 2018

oxinabox left a comment

oxinabox Jun 21, 2018

oxinabox Jun 21, 2018

oxinabox Jun 21, 2018

codecov-io commented Jun 21, 2018

oxinabox left a comment •

edited

Loading

		Pierre Laurent, Florent Mouillot, Chao Yue, Maria Vanesa Moreno Dominguez, Philippe Ciais, Joana M.P. Nogueira (2018). List of fire patch properties computed and associated NetCDF maps from the MCD64A1 Collection 6 (2000-2016) and the MERIS fire_cci v4.1 (2005-2011) BA products [Data set]. OSU OREME. https://doi.org/10.15148/0e999ffc-e220-41ac-ac85-76e92ecd0320
		if you use this in your research.

Add DataCite API #28

Add DataCite API #28

Conversation

SebastinSanty commented Jun 13, 2018 • edited Loading

oxinabox commented Jun 14, 2018 • edited Loading

oxinabox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastinSanty Jun 15, 2018 • edited Loading

Choose a reason for hiding this comment

oxinabox Jun 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox commented Jun 15, 2018

SebastinSanty commented Jun 15, 2018

oxinabox commented Jun 16, 2018

oxinabox commented Jun 16, 2018 • edited Loading

SebastinSanty commented Jun 16, 2018

SebastinSanty commented Jun 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastinSanty commented Jun 21, 2018

oxinabox left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 21, 2018

Codecov Report

oxinabox left a comment • edited Loading

Choose a reason for hiding this comment

SebastinSanty commented Jun 13, 2018 •

edited

Loading

oxinabox commented Jun 14, 2018 •

edited

Loading

SebastinSanty Jun 15, 2018 •

edited

Loading

oxinabox Jun 16, 2018 •

edited

Loading

oxinabox commented Jun 16, 2018 •

edited

Loading

oxinabox left a comment •

edited

Loading