-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added option to consolidate using doi only. #742
Conversation
Hi @Aazhar, In the biblio glutton lookup (https://github.com/kermitt2/biblio-glutton) we implemented some mechanism to validate the response using a distance function between certain bibliographic data of the query vs the output and discarded results below 70%. You can see the implementation at https://github.com/kermitt2/biblio-glutton/blob/26e080bc8e6b1d6118a62aaef0620d67dd034c9b/lookup/src/main/java/com/scienceminer/lookup/storage/LookupEngine.java#L464 Cheers! |
Hello @lfoppiano indeed postvalidation is implemented directly in biblio glutton, but for crossref I don't know why it was deactivated at some point maybe @kermitt2 could give some hints : https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/utilities/Consolidation.java#L658 , I've added a commit to make it back Being able to consider doi only is just another option value for consolidateHeader api parameter, so it is not going to be the default behavior, but it is important to be able to have such a choice because in some use cases having unrelated data can make users confused, especially using the actual crossref data which contains a lot of duplicates (I've identified a lot of duplicates after running the benchmark described in the biblio glutton documentation), here some of the examples : https://api.crossref.org/works/10.1038/ajg.2008.67 / https://api.crossref.org/works/10.14309/00000434-200903000-00011 https://api.crossref.org/works/10.1001/jama.253.6.805 / https://api.crossref.org/works/10.1001/jama.1985.03350300093027 https://api.crossref.org/works/10.1086/286102 / https://api.crossref.org/works/10.2307/2463293 |
These false positive issues are indeed/probably related more (in volume) to CrossRef. My approach so far was to implement a consolidation with CrossRef more for convenience for the users, but given that the REST api is not reliable enough and that the matching criteria can change at any time, it can't be something as reliable over time as biblio-glutton (independently from the scaling aspect). I tried to explain this here: Relatively to the post-validation with CrossRef, it is simple and if the post-validation with titles have been commented, it means normally that it decreased the performance over the PMC 1943 dataset at the time (note that it might have changed, and I think the CrossRef team has also introduced some post-validation, which could be redundant). So I would not re-introduce it just based on intuition only, benchmarks are here for guiding this kind of choice. I can see some interest in the additional consolidation option when somehow we can know that the consolidation will not be good, for instance because we are processing preprints which are too early to be in CrossRef. Consolidation will not help in this scenario, and in large volume they will be more false positives that look bad for final users. As it happens that some preprints can have a DOI at early stage (the bioRxiv ones), it might still be interesting to consolidation with DOI only. |
…doi." This reverts commit 2acd1fa.
OK I have to say, I am a bit confused here. We have on one hand false positives that we want to avoid for the final users (the "result has nothing to do with the pdf content" case), that's for sure. Then, we know that we have duplicates in CrossRef - so same bibliographical object, more or less same metadata but different DOIs. Some are marked as alias, but many are not see the blog post Double trouble with DOIs. These examples are indeed "duplicates", but for consolidation it doesn't matter? In the consolidation benchmark, we count false these cases because we are very strict (we would never reach 100% accuracy because of that), but shouldn't they be considered actually as good matches, no? |
I think in general duplicates indeed have more or less the same metadata except those having different containers for instance, I don't understand the meaning of it. Regarding the benchmark, maybe consider checking using the alias if it doesn't match the expected doi ? |
Just as a remark, testing the 2 example documents: both glutton and crossref consolidations are currently working as expected (no false positive consolidation) with the normal consolidation mode (1). Merging anyway to cover possible cases and scenarios where it is relevant. |
Hello
When using consolidation and the doi is not present in the document, we proceed with fuzzy matching, the problem is that while this should be very helpful in most cases, there are some cases where it gives some understandable results (from the point of view of general users), because it depends first on the quality of the metadata extraction and especially on the consolidation service..
I've been running the crossref apis benchmark, and it seems that the matching scores have decreased, so I think we should be able to consolidate only using the doi, here are some exemples :
For this one , the original title is in french, in the resulting TEI , the title is in english and we have doi added and journal..
exemple1.pdf
This one, it is obvious that the result has nothing to do with the pdf content:
exemple3.pdf