added option to consolidate using doi only. #742

Aazhar · 2021-04-16T19:18:58Z

Hello

When using consolidation and the doi is not present in the document, we proceed with fuzzy matching, the problem is that while this should be very helpful in most cases, there are some cases where it gives some understandable results (from the point of view of general users), because it depends first on the quality of the metadata extraction and especially on the consolidation service..

I've been running the crossref apis benchmark, and it seems that the matching scores have decreased, so I think we should be able to consolidate only using the doi, here are some exemples :

For this one , the original title is in french, in the resulting TEI , the title is in english and we have doi added and journal..

exemple1.pdf

This one, it is obvious that the result has nothing to do with the pdf content:

exemple3.pdf

coveralls · 2021-04-16T19:37:05Z

Coverage decreased (-0.001%) to 39.999% when pulling 4168261 on option-consolidate-with-doi-only into 3440cf8 on master.

lfoppiano · 2021-04-19T02:27:20Z

Hi @Aazhar,
if I have understood correctly, I think you might want to have something that allow to remove obvious mistakes. I'm not sure whether this change is too restrictive for grobid or can be implemented.

In the biblio glutton lookup (https://github.com/kermitt2/biblio-glutton) we implemented some mechanism to validate the response using a distance function between certain bibliographic data of the query vs the output and discarded results below 70%.

You can see the implementation at https://github.com/kermitt2/biblio-glutton/blob/26e080bc8e6b1d6118a62aaef0620d67dd034c9b/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java#L464

Cheers!
Luca

Aazhar · 2021-04-19T08:05:02Z

Hello @lfoppiano

indeed postvalidation is implemented directly in biblio glutton, but for crossref I don't know why it was deactivated at some point maybe @kermitt2 could give some hints : https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/utilities/Consolidation.java#L658 , I've added a commit to make it back

Being able to consider doi only is just another option value for consolidateHeader api parameter, so it is not going to be the default behavior, but it is important to be able to have such a choice because in some use cases having unrelated data can make users confused, especially using the actual crossref data which contains a lot of duplicates (I've identified a lot of duplicates after running the benchmark described in the biblio glutton documentation), here some of the examples :

https://api.crossref.org/works/10.1038/ajg.2008.67 / https://api.crossref.org/works/10.14309/00000434-200903000-00011

https://api.crossref.org/works/10.1001/jama.253.6.805 / https://api.crossref.org/works/10.1001/jama.1985.03350300093027

https://api.crossref.org/works/10.1086/286102 / https://api.crossref.org/works/10.2307/2463293

kermitt2 · 2021-04-19T13:20:09Z

These false positive issues are indeed/probably related more (in volume) to CrossRef. My approach so far was to implement a consolidation with CrossRef more for convenience for the users, but given that the REST api is not reliable enough and that the matching criteria can change at any time, it can't be something as reliable over time as biblio-glutton (independently from the scaling aspect). I tried to explain this here:

#616

Relatively to the post-validation with CrossRef, it is simple and if the post-validation with titles have been commented, it means normally that it decreased the performance over the PMC 1943 dataset at the time (note that it might have changed, and I think the CrossRef team has also introduced some post-validation, which could be redundant). So I would not re-introduce it just based on intuition only, benchmarks are here for guiding this kind of choice.

I can see some interest in the additional consolidation option when somehow we can know that the consolidation will not be good, for instance because we are processing preprints which are too early to be in CrossRef. Consolidation will not help in this scenario, and in large volume they will be more false positives that look bad for final users. As it happens that some preprints can have a DOI at early stage (the bioRxiv ones), it might still be interesting to consolidation with DOI only.

…doi." This reverts commit 2acd1fa.

kermitt2 · 2021-04-19T13:46:11Z

it is important to be able to have such a choice because in some use cases having unrelated data can make users confused, especially using the actual crossref data which contains a lot of duplicates (I've identified a lot of duplicates after running the benchmark described in the biblio glutton documentation), here some of the examples :

https://api.crossref.org/works/10.1038/ajg.2008.67 / https://api.crossref.org/works/10.14309/00000434-200903000-00011

https://api.crossref.org/works/10.1001/jama.253.6.805 / https://api.crossref.org/works/10.1001/jama.1985.03350300093027

https://api.crossref.org/works/10.1086/286102 / https://api.crossref.org/works/10.2307/2463293

OK I have to say, I am a bit confused here. We have on one hand false positives that we want to avoid for the final users (the "result has nothing to do with the pdf content" case), that's for sure. Then, we know that we have duplicates in CrossRef - so same bibliographical object, more or less same metadata but different DOIs. Some are marked as alias, but many are not see the blog post Double trouble with DOIs. These examples are indeed "duplicates", but for consolidation it doesn't matter?

In the consolidation benchmark, we count false these cases because we are very strict (we would never reach 100% accuracy because of that), but shouldn't they be considered actually as good matches, no?

Aazhar · 2021-04-19T14:19:55Z

OK I have to say, I am a bit confused here. We have on one hand false positives that we want to avoid for the final users (the "result has nothing to do with the pdf content" case), that's for sure. Then, we know that we have duplicates in CrossRef - so same bibliographical object, more or less same metadata but different DOIs. Some are marked as alias, but many are not see the blog post Double trouble with DOIs. These examples are indeed "duplicates", but for consolidation it doesn't matter?

In the consolidation benchmark, we count false these cases because we are very strict (we would never reach 100% accuracy because of that), but shouldn't they be considered actually as good, no?

I think in general duplicates indeed have more or less the same metadata except those having different containers for instance, I don't understand the meaning of it.

Regarding the benchmark, maybe consider checking using the alias if it doesn't match the expected doi ?

kermitt2 · 2022-10-21T17:37:38Z

Just as a remark, testing the 2 example documents: both glutton and crossref consolidations are currently working as expected (no false positive consolidation) with the normal consolidation mode (1).

Merging anyway to cover possible cases and scenarios where it is relevant.

Aazhar added 2 commits April 16, 2021 21:16

added option to consolidate using doi only.

3d9f8c6

repair unit test.

8de1603

update doc with new consolidateHeader option.

6995385

lfoppiano added the consolidation Issue related to consolidation and biblio-glutton/crossref external service label Apr 19, 2021

reactivate postvalidation for crossref consolidation without doi.

2acd1fa

Revert "reactivate postvalidation for crossref consolidation without …

5db13ba

…doi." This reverts commit 2acd1fa.

Aazhar added 3 commits October 5, 2021 17:08

activate correction when consolidation with doi only.

e22fa45

Merge branch 'master' into option-consolidate-with-doi-only

906d16a

Merge branch 'master' into option-consolidate-with-doi-only

87f1137

kermitt2 added this to the 0.7.2 milestone Sep 25, 2022

achrafazharccsd added 2 commits October 20, 2022 13:18

Merge branch 'master' into option-consolidate-with-doi-only

1837b3b

Merge branch 'master' into option-consolidate-with-doi-only

c0b27cf

minor rephrase/typos

4168261

kermitt2 merged commit c910f98 into master Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added option to consolidate using doi only. #742

added option to consolidate using doi only. #742

Aazhar commented Apr 16, 2021 •

edited

Loading

coveralls commented Apr 16, 2021 •

edited

Loading

lfoppiano commented Apr 19, 2021

Aazhar commented Apr 19, 2021

kermitt2 commented Apr 19, 2021 •

edited

Loading

kermitt2 commented Apr 19, 2021 •

edited

Loading

Aazhar commented Apr 19, 2021

kermitt2 commented Oct 21, 2022

added option to consolidate using doi only. #742

added option to consolidate using doi only. #742

Conversation

Aazhar commented Apr 16, 2021 • edited Loading

coveralls commented Apr 16, 2021 • edited Loading

lfoppiano commented Apr 19, 2021

Aazhar commented Apr 19, 2021

kermitt2 commented Apr 19, 2021 • edited Loading

kermitt2 commented Apr 19, 2021 • edited Loading

Aazhar commented Apr 19, 2021

kermitt2 commented Oct 21, 2022

Aazhar commented Apr 16, 2021 •

edited

Loading

coveralls commented Apr 16, 2021 •

edited

Loading

kermitt2 commented Apr 19, 2021 •

edited

Loading

kermitt2 commented Apr 19, 2021 •

edited

Loading