Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataCite API #28

Merged
merged 4 commits into from
Jun 21, 2018
Merged

Add DataCite API #28

merged 4 commits into from
Jun 21, 2018

Conversation

SebastinSanty
Copy link
Collaborator

@SebastinSanty SebastinSanty commented Jun 13, 2018

Integration Tests to be added after your first review. Secondly, I am not able to get the urls. Do you have an idea how to get it? There are some hints regarding resource-type etc.

@oxinabox
Copy link
Owner

oxinabox commented Jun 14, 2018

I don't think it is actually possible to get a download URL out of datacite.
I kinda knew that going in.
This also of-course means integration tests are not possible.
(Since resolving the URL correctly is most of what we are testing with those.)

Take a look at datacite/freya#2
where @mfenner is talking about providing it via content-negotiation for "application/zip"
but it is not done yet
(I believe datacite allows for content negotiation via URL as well as via header which is nice)

Right now I think our go is to provide 95% of the registration block,
i.e. everything apart from the URL and checksum,
then let the user go to website (The DOI's landing page) find the link manually, and then edit the generated code.

Editting the generated code is already part of our normal usage anyway, as they likely want to change the datadep name and probably edit the message.

Copy link
Owner

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really cool.
Do you think we should check the schema at the top of the file?
I'm not sure.
I think maybe we should check it and issue a warning.
this is for v4,
it would work for v3 (which fig share used), mostly, but not perfectly.
So maybe issuing a warning if not match would be the right thing to do.
Since it will make future debugging easier.

I am excited to get zenodo support;
Supporting and more generally support for anything with a DOI issues by DataCite (and since DataCite is one of the largest DOI issuers, particularly for Data that is great)

Though as DataCite can't do URLs this becomes a API of last resort.
Still pretty exciting.

Sadly it doesn't work for Figshare
Try e.g 10.1371/journal.pone.0047999.t004
Figshare does support a slightly older version of the DataCite scheme (e.g. https://figshare.com/articles/225779/1/citations/datacite)
but it's DOI's do not hit the anything on the api of datacite, so I guess it uses a different issuer.

I think making figshare work (along with or prior to making DataVerse work) is the next stage.
Maybe OAI-PMH will help us for that. Maybe not.

src/DataCite.jl Outdated

function mainpage_url(repo::DataCite, dataname)
try
identifier = match(r"\b(10[.][0-9]{4,}(?:[.][0-9]+)*\/(?:(?![\"&\'<>])\S)+)\b", dataname).match
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing this is the regex for as DOI?
Do we use that somewhere else too? (DataDryad?)
I think it would be good to make a function match_doi or similar that does this line
for code understandability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a regex for DOI. I included this to give flexibility to the user to put any url/text with DOI as DataCite supports them all. We use a regex in DataDryad, but that is very specific for a DOI for DataDryad only. Should I still make it into one function?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so yes. I suspect infastructurally DataDryad is not bound to always use DOI numbers with their prefix (I think they probably always will, but maybe not),
and if someone has said it is a DataDryad repo, and given a nondatadryad DOI then that is kinda on them.
We should at some point check the "Unhappy Path" when people do that, and give good error messages. Right now I don't think we do so anyway, even with distinguishing Dryad DOIs from DataCite DOIs.
In anycase, see my point below that maybe this method can be deleted anyway once you have base_url?

src/DataCite.jl Outdated
author = format_authors(authors)
license = attributes["license"]
date = attributes["published"]
paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break this into two lines

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to break this in two

src/DataCite.jl Outdated
date = attributes["published"]
paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])

final = escape_multiline_string("""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to assign this to a variable

src/DataCite.jl Outdated
end

function website(repo::DataCite, mainpage_url)
replace(mainpage_url, "https://api.datacite.org/works/", "https://doi.org/")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to do

base_url(::DataCite) = "https://api.datacite.org/works/"

And use that both in website and mainpage_url.

Though don't we have a fallback for mainpage_url that basically does what your are doing below,
minus the validation that it is a DOI.
And doesn't that fallback use base_url?
In which case we can define base_url then remove the definition of mainpage_url entirely.

Copy link
Collaborator Author

@SebastinSanty SebastinSanty Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what about the validation then? Should I remove it?

Copy link
Owner

@oxinabox oxinabox Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so yes.
I'm not sure if it is actually useful right now.
Is it?
What kind of mistakes is it catching I guess is the question

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I implemented it so that we can provide anything, maybe the search url, or datacite api url, or just the DOI. In any case, it should be able to work.

Pierre Laurent, Florent Mouillot, Chao Yue, Maria Vanesa Moreno Dominguez, Philippe Ciais, Joana M.P. Nogueira (2018). List of fire patch properties computed and associated NetCDF maps from the MCD64A1 Collection 6 (2000-2016) and the MERIS fire_cci v4.1 (2005-2011) BA products [Data set]. OSU OREME. https://doi.org/10.15148/0e999ffc-e220-41ac-ac85-76e92ecd0320
if you use this in your research.


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add at the top level generate function for all datadeps a strip to remove whitespace from ends of description

src/DataCite.jl Outdated
author = format_authors(authors)
license = attributes["license"]
date = attributes["published"]
paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime we have a DOI and want a citation we should do it via
method discussed in #22 (comment)

This is fine for now,
but maybe after this PR make another one that does through and replaces all paper formatting with something based on that kind of idea?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the reason I introduced the format_papers() method. I subsequently can add whatever format is required.

@oxinabox
Copy link
Owner

Hmm
what is actually going on with Figshare.
Looks like they do use DataCite generated DOIs (See https://stats.datacite.org/?fq=allocator_facet%3A%22FIGSHARE+-+figshare%22&#tab-datacentres)

And the following works:
https://api.datacite.org/works/10.6084/m9.figshare.5350216.v1

What does not is:

https://figshare.com/articles/_Comparison_of_SEHC_Trauma_Activation_Patients_and_SEHC_Trauma_Nonactivation_Patients_minimum_alcohol_and_illicit_drug_rates_/225779

Which is associated with the doi: 10.1371/journal.pone.0047999.t004
Which resolves to http://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0047999.t004
which is the same table, but on a different site.

So I am guessing figshare rehosted that existing data, with its existing DOI.
And so it was never issued a datacite DOI number, which means it does not work with their API.

CrossRef issued that DOI:
Their API, is not so great for this
https://api.crossref.org/v1/works/10.1371/journal.pone.0047999.t004

I don't think we can content negotiate anything better
See https://citation.crosscite.org/docs.html
I tried a few.

It might be nice to support DOIs in general via the content-negotiation method.
But the things you can get out of any of the providers except DataCite seem less unuseful.
(Surprising really since we're only getting basic metadata. So maybe it is just this on entry (10.1371/journal.pone.0047999.t004) that has poor metadata)

@SebastinSanty
Copy link
Collaborator Author

So writing down whatever I understood and plan to implement based on your points. Please correct me if I am wrong:

  • In the case we don't get the target site on DataCite ("status": "404"), we'll cross-negotiate.
  • We'll use application/vnd.citationstyles.csl+json as the Accept: for the GET request we make. This gives a JSON as result for which we already have a provision to parse. There is no XML format combined for all of them (CrossCite, DataCite, mEDRA). We have RDF:XML, but I wouldn't prefer using that because that'll be another pain/overhead.
  • On getting the results, check for the source attribute. Accordingly, send a request to the source's API and get the final results for creating the register block

I faced an issue though, I tried doing content negotiation as described above. But unfortunately the content-negotiation results which came for DataCite didn't contain the source attribute. For cross-ref it is working properly.

@oxinabox
Copy link
Owner

So writing down whatever I understood and plan to implement based on your points. Please correct me if I am wrong:

Good idea checking. I seem to have mislead you.
#29 is a separate issue. It would be to create a different generator call it DOI <: DataRepo.
Seperately from what you've made here DataCite <: DataRepo.
Like how we have many ways to generate for DataDryad (DataCite, DataDryad, DataDryadWeb),
a DOI generator would be an alternative.
If it is a good, one (which I think it can be) it could mean that we delete the current DataCite generator just to save on maintenance.

The goal of this PR #28 is to add DataCite support, it has done that successfully (well no URL, I suspected that wasn't going to be possible).
Once you fix up the the few small things discussed in the review, then this should be good to merge.

#29 may or may not be the best next issue to pursue after this one.
I'ld like to see full support for Figshare and DataVerse.
OAI-PMH is one path that might do it (though I suspect it also won't let use actually get download URLs)
#30 will do figshare (and others) fully but not DataVerse.

BTW: cross-negotiate isn't a term that I am familiar with. I think you mean content negotiate

@oxinabox
Copy link
Owner

oxinabox commented Jun 16, 2018

For This PR. Something I think I missed in the code-review before:

it should displace some kind of info("DataCite based generation can only generate partial registration blocks, as DataCite metadata does not (currently) include the URL to the resource. You will have to edit in the URL after generation.")
And it should probably stick in the place as the URL a something like "PUT DOWNLOAD URL HERE".

Looks like the test failure are something to do with the Github generator breaking.

@SebastinSanty
Copy link
Collaborator Author

Its good that I asked before implementing it in the PR, saved some work of removing it.

@SebastinSanty
Copy link
Collaborator Author

Need to merge #31 before this.

@@ -114,6 +114,11 @@ function format_papers(authors::Vector, year::String, name::String, link::String
join(authors, ", ") * " ($year). " * name * " " * link
end

function check_dois(uri::String)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doi singlular, as it is only one.

and check isn't right.
Check implies that this returns true if something is a DOI, and false if not.

extract, try_extract or match (since this corresponds to the regex), are better.
Maybe get_match

src/DataCite.jl Outdated
@@ -38,13 +40,13 @@ function data_fullname(::DataCite, mainpage)
end

function website(repo::DataCite, mainpage_url)
replace(mainpage_url, "https://api.datacite.org/works/", "https://doi.org/")
replace(mainpage_url, base_url(repo), "https://doi.org/")
end

function mainpage_url(repo::DataCite, dataname)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own future reference,
this is infact different in functionality to
https://github.com/oxinabox/DataDepsGenerators.jl/blob/master/src/DataDepsGenerators.jl#L113-L122

This one takes anything that countains a DOI, e.g. some other URL, and gets the DOI from it.
Where are that one only works if the correct full URL, or just a DOI is passed.

@SebastinSanty
Copy link
Collaborator Author

@oxinabox Ready to merged if you don't have any reviews.

Copy link
Owner

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some things brought up earlier got forgotten in data discussions and in sorting out bugs.
Just small things so should be trivial to fix

src/DataCite.jl Outdated
author = format_authors(authors)
license = attributes["license"]
date = attributes["published"]
paper = format_papers(authors, date, attributes["title"] * " [Data set]. " * attributes["container-title"] * ".", mainpage["id"])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to break this in two

src/DataCite.jl Outdated
end

function get_urls(repo::DataCite, page)
urls = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Produce a info here saying something about DataCite API does not support URL, URL must be added after manually.

And add a dummy output like "URL HERE", maybe

src/DataCite.jl Outdated
Please cite this paper:
$(paper)
if you use this in your research.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete blank line

@codecov-io
Copy link

Codecov Report

Merging #28 into master will increase coverage by 0.15%.
The diff coverage is 95.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #28      +/-   ##
==========================================
+ Coverage   93.93%   94.09%   +0.15%     
==========================================
  Files          13       14       +1     
  Lines         231      254      +23     
==========================================
+ Hits          217      239      +22     
- Misses         14       15       +1
Impacted Files Coverage Δ
src/DataDepsGenerators.jl 94.28% <100%> (+0.53%) ⬆️
src/DataCite.jl 95% <95%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a004321...b6ca9d8. Read the comment docs.

Copy link
Owner

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests pass then all good.
Nice work

@SebastinSanty SebastinSanty merged commit e3fdf55 into oxinabox:master Jun 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants