-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetcher for IACR eprints #3473
Fetcher for IACR eprints #3473
Conversation
Note: This is WIP, the fetcher kind of worked in a few ad-hoc tests, but it certainly isn't ready for production yet!
return formattedDates.get(0); | ||
} | ||
|
||
private static String getValueBetween(String from, String to, String haystack) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could see if there is already a method in StringUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, there was indeed one. Keeping the functionality in a separate method though to have the error handling (nothing found) in one place.
if (dateMatcher.find()) { | ||
Date date = DATE_FORMAT_WEBSITE.parse(dateMatcher.group(1)); | ||
formattedDates.add(DATE_FORMAT_BIBTEX.format(date)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always catch the most specific exceptions.
Hi, |
The problem with using libraries to retrieve the information from html is that their html seems to be somewhat buggy (using |
By moving all the "what is the content of field xy?" logic to methods, the setAdditionalFields method gives a good overview over which fields are set.
The problem is that the matching of strings (find abstract, ... in the HTML) is done against strings in the source code - therefore they are encoded with the source code encoding. The downloaded HTML however should be encoded with whatever the user selected for her library. But then, the matching fails. So this needs further investigation.
From my point of view, the fetcher with its core features is finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me! Thanks for your contribution. I give my ok for merge, but it would be nice if you could migrate the added tests to junit 5 before.
import org.jabref.model.entry.FieldName; | ||
import org.jabref.testutils.category.FetcherTest; | ||
|
||
import org.junit.Before; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just started to use JUnit 5 and it would be nice if you could use the new api (some of the other fetcher tests are already migrated).
import static org.junit.Assert.*; | ||
import static org.mockito.Mockito.mock; | ||
|
||
@Category(FetcherTest.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When converting to JUnit 5, please replace the category by the FetcherTest
annotation.
@@ -37,6 +38,7 @@ | |||
private static final Predicate<String> IDENTIFIER_PREDICATE = Pattern.compile("\\d{4}/\\d{3,5}").asPredicate(); | |||
private static final String CITATION_URL_PREFIX = "https://eprint.iacr.org/eprint-bin/cite.pl?entry="; | |||
private static final String DESCRIPTION_URL_PREFIX = "https://eprint.iacr.org/"; | |||
private static final Charset WEBSITE_CHARSET = Charset.forName("iso-8859-1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very very minor, you can directly use the predefined enum Constant:
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#ISO_8859_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out - was looking for something like that, but apparently didn't look long enough...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other cases where Java provides default enum variables, they all start with StandardXXX, for example for file opening there exists: StandardOpenOption, not really obvious if you search for it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jep, I expected them in something like Charsets or directly as constants in the Charset class...
private static final Log LOGGER = LogFactory.getLog(IacrEprintFetcher.class); | ||
private static final Pattern DATE_FROM_WEBSITE_PATTERN = Pattern.compile("[a-z ]+(\\d{1,2} [A-Za-z][a-z]{2} \\d{4})"); | ||
private static final DateFormat DATE_FORMAT_WEBSITE = new SimpleDateFormat("dd MMM yyyy"); | ||
private static final DateFormat DATE_FORMAT_BIBTEX = new SimpleDateFormat("yyyy-MM-dd"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleDateFormat is outdated, it has been replaced by several other constructs in java8:
http://www.baeldung.com/java-8-date-time-intro
Or see example 18 here for an idea: http://javarevisited.blogspot.de/2015/03/20-examples-of-date-and-time-api-from-Java8.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know that - I'm working on changing it.
String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier); | ||
String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml); | ||
|
||
Optional<BibEntry> entry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, but you probably need to initialize it with Optional.empty() or you could get still an NPE if no entry is found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the javadoc on BibtexParser.singleFromString is correct, it should always return an entry or an Optional.empty().
But I might as well initialize it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
abram2017 = new BibEntry(); | ||
abram2017.setType(BiblatexEntryTypes.MISC); | ||
abram2017.setField("bibtexkey", "cryptoeprint:2017:1118"); | ||
abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time, and provides safety and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no abstracts in tests, as they are usually subject to copyright of the publisher
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into the license details - not having the abstract there is a bit of a problem as fetching the abstract is part of the functionality; therefore it should be tested.
On first glance, the whole article is published under CC-BY or CC-BY-NC; therefore this shouldn't be a problem as the authors are mentioned right next to the abstract text.
But if you still don't want the abstracts, I can modify the tests accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet ;) But then I'll just remove the abstracts for now and maybe just check that they are not empty.
String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier); | ||
String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml); | ||
|
||
Optional<BibEntry> entry; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.
@FetcherTest | ||
public class IacrEprintFetcherTest { | ||
|
||
private static IacrEprintFetcher fetcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please not static and use BeforeEach
instead BeforeAll
(better initialize the fetcher and entries fresh for each test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's my first time using JUnit 5 and I just looked at some examples online which used BeforeAll
...
Should have been suspicious when adding the static
s...
abram2017 = new BibEntry(); | ||
abram2017.setType(BiblatexEntryTypes.MISC); | ||
abram2017.setField("bibtexkey", "cryptoeprint:2017:1118"); | ||
abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time, and provides safety and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.
As pointed out by @Siedlerchr in JabRef#3473, the abstracts might be a copyright problem.
I just ran some "manual" tests for very old eprints and discovered that they indeed changed something in 2000 - up to 2000, the eprints don't have versions and the date format is a bit different. |
The entries before year 2000 use a slightly different format which e.g. doesn't include a version, also the date format is different. With this commit, we also throw an error if the user tries to fetch an entry for a withdrawn paper. This is meant as a warning to the user, she might still add the entry manually to her database. This will be especially useful once a "search by title" or something similar gets implemented.
I implemented support for pre-2000 entries; also added some tests for these as the pre-2000 entries don't seem to really have a "strong" standard format. I will now do some more "manual" testing, but I think everything will be finished in the next few hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks really good. I have just one remark concerning the translated strings and another suggestion:
There exists a convenient interface https://github.com/JabRef/jabref/blob/master/src/main/java/org/jabref/logic/importer/IdBasedParserFetcher.java for fetcher that follow the usual scheme: determine url, fetch response, parse. Since your fetcher follows this strategy it might be better to use IdBasedParserFetcher
as a base. There are quite a few implementations already which may serve as a guide.
@@ -1819,6 +1819,11 @@ Copy_BibTeX_key_and_title= | |||
File_rename_failed_for_%0_entries.= | |||
Merged_BibTeX_source_code= | |||
Invalid_DOI\:_'%0'.=Ugyldig_DOI\:_'%0'. | |||
Invalid_IACR_identifier\:_'%0'.= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to use a more generic version in these strings. As of now, they are not reusable in other fetchers or situations. E.g. just use invalid identifier
or replace ICAR
by a parameter slot.
Also: Remove some of them completly, replacing them with slightly different existing ones.
I took a look at the |
ok, then leave it like that. It was just a suggestion that came to my mind while browsing the code. |
codacy complained about reassigning a method parameter and about the visibility of a test method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks really good and I have nothing to criticize. It is ready to merge from my point of view.
However, I tried to test the feature in a running JabRef and somehow the IACR fetcher is not showing up in the drop-down menu in the web search side bar. It should be there, right? Am I missing something?
@lenhard I see your point - I haven't thought about that as I usually only use the "New entry" -> "ID based" way to fetch entries. |
@derTimme Thanks for the explanation! What you write makes sense. There's no need to add the fetcher to the web search side bar as well. Unfortunately, there are some merge conflicts in the language files now. Could you please resolve those and then this is ready to go into master. |
I'm really confused right now... I tried to merge in master and apparently it worked - in my local repo, the localization files look fine and I can do
without any conflicts... |
@derTimme You have to first configure the JabRef/jabref repo as remote repository. Your master branch is still on the version when it was first forked. It does not get synced automatically. Gitbub Help on adding a cloned repo with a remote one Add JabRef as remote repo:
The last one merges the changes from the upstream repo (in this case the JabRef/jabref main repo) |
Okay, as all conflicts are resolved and the reviews are okay, I merge it now into master! |
I'll merge this now before you have to cope again with changes on the master branch. I'm sorry for the inconvenience caused by our recent change to the language files. Thanks for your contribution! Edit: ok @Siedlerchr was quicker :-) |
Thank you :) |
* upstream/master: (108 commits) Fetcher for IACR eprints (#3473) Update internal state of DatabaseChangeMonitor when external changes … (#3503) Fixes #3505: Another try to fix the NPE in the search bar (#3512) Replace ' with ' so that our HTML preview can handle it correctly Added a "Clear text" button in right click menu within the text boxes. (#3475) Add reset to English language after a test New translations JabRef_en.properties (German) Remove ampersand in non-menu localizations New translations JabRef_en.properties (German) New translations Menu_en.properties (German) New translations Menu_en.properties (German) New translations JabRef_en.properties (Vietnamese) New translations JabRef_en.properties (Italian) New translations Menu_en.properties (Italian) New translations JabRef_en.properties (Indonesian) New translations Menu_en.properties (Indonesian) New translations JabRef_en.properties (Greek) New translations Menu_en.properties (Greek) New translations Menu_en.properties (Japanese) New translations JabRef_en.properties (German) ... # Conflicts: # build.gradle
I'm working on a new fetcher for IACR eprints, as I need a lot of these for my current work.
IACR does not provide a real API, so the information needs to be parsed out of their (really simple) HTML.
@devs: Are there any special requirements for fetchers? (I couldn't find any, but the whole HTML parsing thing might be a problem concerning maintainability - on the other hand, I use IACR for quite a while now and the web interface hasn't changed (visibly at least)).
The idea is to have the following features:
2017/1556
)misc
and contains:\url{https://eprint.iacr.org/<identifier>}
(recommended by IACR)Cryptology ePrint Archive, Report <identifier>
(recommended by IACR)2017/1095
)The following could also make sense:
gradle localizationUpdate
?