Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added option to consolidate using doi only. #742

Merged
merged 11 commits into from
Oct 21, 2022

Conversation

Aazhar
Copy link
Collaborator

@Aazhar Aazhar commented Apr 16, 2021

Hello

When using consolidation and the doi is not present in the document, we proceed with fuzzy matching, the problem is that while this should be very helpful in most cases, there are some cases where it gives some understandable results (from the point of view of general users), because it depends first on the quality of the metadata extraction and especially on the consolidation service..

I've been running the crossref apis benchmark, and it seems that the matching scores have decreased, so I think we should be able to consolidate only using the doi, here are some exemples :

For this one , the original title is in french, in the resulting TEI , the title is in english and we have doi added and journal..

exemple1.pdf

This one, it is obvious that the result has nothing to do with the pdf content:

exemple3.pdf

@coveralls
Copy link

coveralls commented Apr 16, 2021

Coverage Status

Coverage decreased (-0.001%) to 39.999% when pulling 4168261 on option-consolidate-with-doi-only into 3440cf8 on master.

@lfoppiano
Copy link
Collaborator

Hi @Aazhar,
if I have understood correctly, I think you might want to have something that allow to remove obvious mistakes. I'm not sure whether this change is too restrictive for grobid or can be implemented.

In the biblio glutton lookup (https://github.com/kermitt2/biblio-glutton) we implemented some mechanism to validate the response using a distance function between certain bibliographic data of the query vs the output and discarded results below 70%.

You can see the implementation at https://github.com/kermitt2/biblio-glutton/blob/26e080bc8e6b1d6118a62aaef0620d67dd034c9b/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java#L464

Cheers!
Luca

@lfoppiano lfoppiano added the consolidation Issue related to consolidation and biblio-glutton/crossref external service label Apr 19, 2021
@Aazhar
Copy link
Collaborator Author

Aazhar commented Apr 19, 2021

Hello @lfoppiano

indeed postvalidation is implemented directly in biblio glutton, but for crossref I don't know why it was deactivated at some point maybe @kermitt2 could give some hints : https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/utilities/Consolidation.java#L658 , I've added a commit to make it back

Being able to consider doi only is just another option value for consolidateHeader api parameter, so it is not going to be the default behavior, but it is important to be able to have such a choice because in some use cases having unrelated data can make users confused, especially using the actual crossref data which contains a lot of duplicates (I've identified a lot of duplicates after running the benchmark described in the biblio glutton documentation), here some of the examples :

https://api.crossref.org/works/10.1038/ajg.2008.67 / https://api.crossref.org/works/10.14309/00000434-200903000-00011

https://api.crossref.org/works/10.1001/jama.253.6.805 / https://api.crossref.org/works/10.1001/jama.1985.03350300093027

https://api.crossref.org/works/10.1086/286102 / https://api.crossref.org/works/10.2307/2463293

@kermitt2
Copy link
Owner

kermitt2 commented Apr 19, 2021

These false positive issues are indeed/probably related more (in volume) to CrossRef. My approach so far was to implement a consolidation with CrossRef more for convenience for the users, but given that the REST api is not reliable enough and that the matching criteria can change at any time, it can't be something as reliable over time as biblio-glutton (independently from the scaling aspect). I tried to explain this here:

#616

Relatively to the post-validation with CrossRef, it is simple and if the post-validation with titles have been commented, it means normally that it decreased the performance over the PMC 1943 dataset at the time (note that it might have changed, and I think the CrossRef team has also introduced some post-validation, which could be redundant). So I would not re-introduce it just based on intuition only, benchmarks are here for guiding this kind of choice.

I can see some interest in the additional consolidation option when somehow we can know that the consolidation will not be good, for instance because we are processing preprints which are too early to be in CrossRef. Consolidation will not help in this scenario, and in large volume they will be more false positives that look bad for final users. As it happens that some preprints can have a DOI at early stage (the bioRxiv ones), it might still be interesting to consolidation with DOI only.

@kermitt2
Copy link
Owner

kermitt2 commented Apr 19, 2021

it is important to be able to have such a choice because in some use cases having unrelated data can make users confused, especially using the actual crossref data which contains a lot of duplicates (I've identified a lot of duplicates after running the benchmark described in the biblio glutton documentation), here some of the examples :

https://api.crossref.org/works/10.1038/ajg.2008.67 / https://api.crossref.org/works/10.14309/00000434-200903000-00011

https://api.crossref.org/works/10.1001/jama.253.6.805 / https://api.crossref.org/works/10.1001/jama.1985.03350300093027

https://api.crossref.org/works/10.1086/286102 / https://api.crossref.org/works/10.2307/2463293

OK I have to say, I am a bit confused here. We have on one hand false positives that we want to avoid for the final users (the "result has nothing to do with the pdf content" case), that's for sure. Then, we know that we have duplicates in CrossRef - so same bibliographical object, more or less same metadata but different DOIs. Some are marked as alias, but many are not see the blog post Double trouble with DOIs. These examples are indeed "duplicates", but for consolidation it doesn't matter?

In the consolidation benchmark, we count false these cases because we are very strict (we would never reach 100% accuracy because of that), but shouldn't they be considered actually as good matches, no?

@Aazhar
Copy link
Collaborator Author

Aazhar commented Apr 19, 2021

OK I have to say, I am a bit confused here. We have on one hand false positives that we want to avoid for the final users (the "result has nothing to do with the pdf content" case), that's for sure. Then, we know that we have duplicates in CrossRef - so same bibliographical object, more or less same metadata but different DOIs. Some are marked as alias, but many are not see the blog post Double trouble with DOIs. These examples are indeed "duplicates", but for consolidation it doesn't matter?

In the consolidation benchmark, we count false these cases because we are very strict (we would never reach 100% accuracy because of that), but shouldn't they be considered actually as good, no?

I think in general duplicates indeed have more or less the same metadata except those having different containers for instance, I don't understand the meaning of it.

Regarding the benchmark, maybe consider checking using the alias if it doesn't match the expected doi ?

@kermitt2 kermitt2 added this to the 0.7.2 milestone Sep 25, 2022
@kermitt2
Copy link
Owner

Just as a remark, testing the 2 example documents: both glutton and crossref consolidations are currently working as expected (no false positive consolidation) with the normal consolidation mode (1).

Merging anyway to cover possible cases and scenarios where it is relevant.

@kermitt2 kermitt2 merged commit c910f98 into master Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consolidation Issue related to consolidation and biblio-glutton/crossref external service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants