Fetcher for IACR eprints #3473

derTimme · 2017-11-30T12:31:02Z

I'm working on a new fetcher for IACR eprints, as I need a lot of these for my current work.

IACR does not provide a real API, so the information needs to be parsed out of their (really simple) HTML.

@devs: Are there any special requirements for fetchers? (I couldn't find any, but the whole HTML parsing thing might be a problem concerning maintainability - on the other hand, I use IACR for quite a while now and the web interface hasn't changed (visibly at least)).

The idea is to have the following features:

The following could also make sense:

Get fulltext (if that means to download the pdf - I haven't used this feature with other fetchers yet, so I don't really know what exactly it does)
Get the "also published in" information (see e.g. this paper) and DOI, if present (e.g. this paper)
Get keywords
Search in IACR preprints (e.g. for title) - but I suspect this would require us to download the complete archive or somehow "fill out" their search form (by crafting the corresponding POST request) programatically

Change in CHANGELOG.md described
Tests created for changes
Manually tested changed features in running JabRef
Check documentation status (Issue created for outdated help page at help.jabref.org?)
If you changed the localization: Did you run gradle localizationUpdate?

Note: This is WIP, the fetcher kind of worked in a few ad-hoc tests, but it certainly isn't ready for production yet!

Siedlerchr · 2017-11-30T13:37:26Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

+        return formattedDates.get(0);
+    }
+
+    private static String getValueBetween(String from, String to, String haystack) {


You could see if there is already a method in StringUtils

Thanks, there was indeed one. Keeping the functionality in a separate method though to have the error handling (nothing found) in one place.

Siedlerchr · 2017-11-30T13:37:54Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

+                if (dateMatcher.find()) {
+                    Date date = DATE_FORMAT_WEBSITE.parse(dateMatcher.group(1));
+                    formattedDates.add(DATE_FORMAT_BIBTEX.format(date));
+                }


Please always catch the most specific exceptions.

Siedlerchr · 2017-11-30T13:39:06Z

Hi,
I think the html parsing is not really a problem. We already use that in other cases as well. You could also take a look at jsoup for parsing html contents/finding dat between elements (should be already included in our deps)

derTimme · 2017-11-30T13:42:01Z

The problem with using libraries to retrieve the information from html is that their html seems to be somewhat buggy (using <p /> for example).

By moving all the "what is the content of field xy?" logic to methods, the setAdditionalFields method gives a good overview over which fields are set.

The problem is that the matching of strings (find abstract, ... in the HTML) is done against strings in the source code - therefore they are encoded with the source code encoding. The downloaded HTML however should be encoded with whatever the user selected for her library. But then, the matching fails. So this needs further investigation.

derTimme · 2017-12-01T12:59:28Z

From my point of view, the fetcher with its core features is finished.
The additional features mentioned in the initial comment are something I might tackle later on.

tobiasdiez

The code looks good to me! Thanks for your contribution. I give my ok for merge, but it would be nice if you could migrate the added tests to junit 5 before.

tobiasdiez · 2017-12-01T13:33:12Z

src/test/java/org/jabref/logic/importer/fetcher/IacrEprintFetcherTest.java

+import org.jabref.model.entry.FieldName;
+import org.jabref.testutils.category.FetcherTest;
+
+import org.junit.Before;


We just started to use JUnit 5 and it would be nice if you could use the new api (some of the other fetcher tests are already migrated).

tobiasdiez · 2017-12-01T13:33:46Z

src/test/java/org/jabref/logic/importer/fetcher/IacrEprintFetcherTest.java

+import static org.junit.Assert.*;
+import static org.mockito.Mockito.mock;
+
+@Category(FetcherTest.class)


When converting to JUnit 5, please replace the category by the FetcherTest annotation.

Siedlerchr · 2017-12-01T13:47:44Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

@@ -37,6 +38,7 @@
    private static final Predicate<String> IDENTIFIER_PREDICATE = Pattern.compile("\\d{4}/\\d{3,5}").asPredicate();
    private static final String CITATION_URL_PREFIX = "https://eprint.iacr.org/eprint-bin/cite.pl?entry=";
    private static final String DESCRIPTION_URL_PREFIX = "https://eprint.iacr.org/";
+    private static final Charset WEBSITE_CHARSET = Charset.forName("iso-8859-1");


very very minor, you can directly use the predefined enum Constant:
https://docs.oracle.com/javase/8/docs/api/java/nio/charset/StandardCharsets.html#ISO_8859_1

Thanks for pointing that out - was looking for something like that, but apparently didn't look long enough...

For other cases where Java provides default enum variables, they all start with StandardXXX, for example for file opening there exists: StandardOpenOption, not really obvious if you search for it ;)

Jep, I expected them in something like Charsets or directly as constants in the Charset class...

Siedlerchr · 2017-12-02T11:48:41Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

+    private static final Log LOGGER = LogFactory.getLog(IacrEprintFetcher.class);
+    private static final Pattern DATE_FROM_WEBSITE_PATTERN = Pattern.compile("[a-z ]+(\\d{1,2} [A-Za-z][a-z]{2} \\d{4})");
+    private static final DateFormat DATE_FORMAT_WEBSITE = new SimpleDateFormat("dd MMM yyyy");
+    private static final DateFormat DATE_FORMAT_BIBTEX = new SimpleDateFormat("yyyy-MM-dd");


SimpleDateFormat is outdated, it has been replaced by several other constructs in java8:
http://www.baeldung.com/java-8-date-time-intro
Or see example 18 here for an idea: http://javarevisited.blogspot.de/2015/03/20-examples-of-date-and-time-api-from-Java8.html

Didn't know that - I'm working on changing it.

Siedlerchr · 2017-12-02T11:49:39Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

+        String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier);
+        String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml);
+
+        Optional<BibEntry> entry;


I am not sure, but you probably need to initialize it with Optional.empty() or you could get still an NPE if no entry is found

If the javadoc on BibtexParser.singleFromString is correct, it should always return an entry or an Optional.empty().
But I might as well initialize it...

you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.

Siedlerchr · 2017-12-02T11:50:36Z

src/test/java/org/jabref/logic/importer/fetcher/IacrEprintFetcherTest.java

+        abram2017 = new BibEntry();
+        abram2017.setType(BiblatexEntryTypes.MISC);
+        abram2017.setField("bibtexkey", "cryptoeprint:2017:1118");
+        abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time,  and provides safety  and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n");


Please no abstracts in tests, as they are usually subject to copyright of the publisher

I'll look into the license details - not having the abstract there is a bit of a problem as fetching the abstract is part of the functionality; therefore it should be tested.
On first glance, the whole article is published under CC-BY or CC-BY-NC; therefore this shouldn't be a problem as the authors are mentioned right next to the abstract text.
But if you still don't want the abstracts, I can modify the tests accordingly.

Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.

Not yet ;) But then I'll just remove the abstracts for now and maybe just check that they are not empty.

tobiasdiez · 2017-12-02T12:15:36Z

src/main/java/org/jabref/logic/importer/fetcher/IacrEprintFetcher.java

+        String bibtexCitationHtml = getHtml(CITATION_URL_PREFIX + validIdentifier);
+        String actualEntry = getValueBetween("<PRE>", "</PRE>", bibtexCitationHtml);
+
+        Optional<BibEntry> entry;


you can also just return the entry directly in the try construct. This is in my opinion the most readable solution.

tobiasdiez · 2017-12-02T12:17:18Z

src/test/java/org/jabref/logic/importer/fetcher/IacrEprintFetcherTest.java

+@FetcherTest
+public class IacrEprintFetcherTest {
+
+    private static IacrEprintFetcher fetcher;


Please not static and use BeforeEach instead BeforeAll (better initialize the fetcher and entries fresh for each test)

Thanks, it's my first time using JUnit 5 and I just looked at some examples online which used BeforeAll...
Should have been suspicious when adding the statics...

tobiasdiez · 2017-12-02T12:18:26Z

src/test/java/org/jabref/logic/importer/fetcher/IacrEprintFetcherTest.java

+        abram2017 = new BibEntry();
+        abram2017.setType(BiblatexEntryTypes.MISC);
+        abram2017.setField("bibtexkey", "cryptoeprint:2017:1118");
+        abram2017.setField(FieldName.ABSTRACT, "The decentralized cryptocurrency Bitcoin has experienced great success but also encountered many challenges. One of the challenges has been the long confirmation time. Another challenge is the lack of incentives at certain steps of the protocol, raising concerns for transaction withholding, selfish mining, etc. To address these challenges, we propose Solida, a decentralized blockchain protocol based on reconfigurable Byzantine consensus augmented by proof-of-work. Solida improves on Bitcoin in confirmation time,  and provides safety  and liveness assuming the adversary control less than (roughly) one-third of the total mining power.\n");


Do you have a published article that can be received via IACR? In this case you hold the copyright and there is no problem.

@Siedlerchr

As pointed out by @Siedlerchr in JabRef#3473, the abstracts might be a copyright problem.

derTimme · 2017-12-02T16:36:33Z

I just ran some "manual" tests for very old eprints and discovered that they indeed changed something in 2000 - up to 2000, the eprints don't have versions and the date format is a bit different.
I made this "WIP" again and am working on also supporting these old entries.

The entries before year 2000 use a slightly different format which e.g. doesn't include a version, also the date format is different. With this commit, we also throw an error if the user tries to fetch an entry for a withdrawn paper. This is meant as a warning to the user, she might still add the entry manually to her database. This will be especially useful once a "search by title" or something similar gets implemented.

derTimme · 2017-12-07T12:02:30Z

I implemented support for pre-2000 entries; also added some tests for these as the pre-2000 entries don't seem to really have a "strong" standard format.
Luckily there aren't too much of these (about 65), so the tests just cover all of them.

I will now do some more "manual" testing, but I think everything will be finished in the next few hours.

tobiasdiez

The code looks really good. I have just one remark concerning the translated strings and another suggestion:
There exists a convenient interface https://github.com/JabRef/jabref/blob/master/src/main/java/org/jabref/logic/importer/IdBasedParserFetcher.java for fetcher that follow the usual scheme: determine url, fetch response, parse. Since your fetcher follows this strategy it might be better to use IdBasedParserFetcher as a base. There are quite a few implementations already which may serve as a guide.

tobiasdiez · 2017-12-07T12:14:01Z

src/main/resources/l10n/JabRef_da.properties

@@ -1819,6 +1819,11 @@ Copy_BibTeX_key_and_title=
 File_rename_failed_for_%0_entries.=
 Merged_BibTeX_source_code=
 Invalid_DOI\:_'%0'.=Ugyldig_DOI\:_'%0'.
+Invalid_IACR_identifier\:_'%0'.=


Please try to use a more generic version in these strings. As of now, they are not reusable in other fetchers or situations. E.g. just use invalid identifier or replace ICAR by a parameter slot.

Also: Remove some of them completly, replacing them with slightly different existing ones.

derTimme · 2017-12-07T16:45:07Z

I took a look at the IdBasedParserFetcher and if you insist, I could use it - but this fetcher works slightly different in that it requests two different URLs and builds the resulting entry with information from both those URLs.

tobiasdiez · 2017-12-07T16:55:50Z

ok, then leave it like that. It was just a suggestion that came to my mind while browsing the code.

codacy complained about reassigning a method parameter and about the visibility of a test method.

lenhard

The code looks really good and I have nothing to criticize. It is ready to merge from my point of view.

However, I tried to test the feature in a running JabRef and somehow the IACR fetcher is not showing up in the drop-down menu in the web search side bar. It should be there, right? Am I missing something?

derTimme · 2017-12-10T20:56:23Z

@lenhard I see your point - I haven't thought about that as I usually only use the "New entry" -> "ID based" way to fetch entries.
After looking into it, I don't think it belongs in the web search side bar as it only searches based on the ID of a paper; the DOI fetcher isn't in the side bar either. This makes sense from my point of view, as the web search side bar doesn't give any clues that it expects an ID. But if you think it makes sense to add it there, I can do that.

lenhard · 2017-12-11T08:38:09Z

@derTimme Thanks for the explanation! What you write makes sense. There's no need to add the fetcher to the web search side bar as well.

Unfortunately, there are some merge conflicts in the language files now. Could you please resolve those and then this is ready to go into master.

derTimme · 2017-12-11T18:24:49Z

I'm really confused right now... I tried to merge in master and apparently it worked - in my local repo, the localization files look fine and I can do

git checkout master
git merge iacr_eprint_fetcher

without any conflicts...

Siedlerchr · 2017-12-11T19:35:45Z

@derTimme You have to first configure the JabRef/jabref repo as remote repository. Your master branch is still on the version when it was first forked. It does not get synced automatically.

Gitbub Help on adding a cloned repo with a remote one

Add JabRef as remote repo:

git remote add upstream https://github.com/JabRef/jabref.git
git checkout <your branch>  
git merge upstream/master

The last one merges the changes from the upstream repo (in this case the JabRef/jabref main repo)

Siedlerchr · 2017-12-13T14:41:26Z

Okay, as all conflicts are resolved and the reviews are okay, I merge it now into master!
Thank you very much for your contribution! We hope you continue your contribution ;)

tobiasdiez · 2017-12-13T14:42:47Z

I'll merge this now before you have to cope again with changes on the master branch. I'm sorry for the inconvenience caused by our recent change to the language files. Thanks for your contribution!

Edit: ok @Siedlerchr was quicker :-)

derTimme · 2017-12-13T14:58:38Z

Thank you :)

* upstream/master: (108 commits) Fetcher for IACR eprints (#3473) Update internal state of DatabaseChangeMonitor when external changes … (#3503) Fixes #3505: Another try to fix the NPE in the search bar (#3512) Replace ' with ' so that our HTML preview can handle it correctly Added a "Clear text" button in right click menu within the text boxes. (#3475) Add reset to English language after a test New translations JabRef_en.properties (German) Remove ampersand in non-menu localizations New translations JabRef_en.properties (German) New translations Menu_en.properties (German) New translations Menu_en.properties (German) New translations JabRef_en.properties (Vietnamese) New translations JabRef_en.properties (Italian) New translations Menu_en.properties (Italian) New translations JabRef_en.properties (Indonesian) New translations Menu_en.properties (Indonesian) New translations JabRef_en.properties (Greek) New translations Menu_en.properties (Greek) New translations Menu_en.properties (Japanese) New translations JabRef_en.properties (German) ... # Conflicts: # build.gradle

derTimme added 3 commits November 30, 2017 12:59

Add an initial WIP version of an IACR eprint fetcher

e23fee2

Note: This is WIP, the fetcher kind of worked in a few ad-hoc tests, but it certainly isn't ready for production yet!

Refactor IACR fetcher and improve error handling

096d857

Localize error messages in IACR fetcher

20ad210

Siedlerchr reviewed Nov 30, 2017

View reviewed changes

derTimme added 5 commits November 30, 2017 15:00

Refactoring of IACR fetcher

83ac917

More refactoring of IACR fetcher

95e431c

By moving all the "what is the content of field xy?" logic to methods, the setAdditionalFields method gives a good overview over which fields are set.

Add tests for IACR fetcher

1076c1d

Make checkstyle happy

89e237b

derTimme mentioned this pull request Dec 1, 2017

Add help for the IACR eprint fetcher JabRef/user-documentation#176

Closed

derTimme added 2 commits December 1, 2017 13:29

Add IACR fetcher to changelog

fad3203

Fix the encoding of entries retrieved via IACR fetcher

025eb88

derTimme changed the title ~~[WIP] Fetcher for IACR eprints~~ Fetcher for IACR eprints Dec 1, 2017

tobiasdiez requested changes Dec 1, 2017

View reviewed changes

tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 1, 2017

Siedlerchr reviewed Dec 1, 2017

View reviewed changes

derTimme added 3 commits December 1, 2017 14:49

Migrate IACR fetcher tests to junit5

4054e16

Use enum constant for IACR fetcher import encoding

53be6f7

Fix a bug in the IACR fetcher tests

58034e5

Siedlerchr reviewed Dec 2, 2017

View reviewed changes

tobiasdiez reviewed Dec 2, 2017

View reviewed changes

derTimme added 2 commits December 2, 2017 13:45

Migrate IACR fetcher to java8 date classes

4fa1534

Remove abstracts from IACR fetcher tests

8a091b0

As pointed out by @Siedlerchr in JabRef#3473, the abstracts might be a copyright problem.

derTimme changed the title ~~Fetcher for IACR eprints~~ [WIP] Fetcher for IACR eprints Dec 2, 2017

tobiasdiez removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 2, 2017

tobiasdiez requested changes Dec 7, 2017

View reviewed changes

derTimme added 2 commits December 7, 2017 16:39

Make the localization strings in the IACR fetcher reusable

2ac6451

Also: Remove some of them completly, replacing them with slightly different existing ones.

Disable a long running test for the IACR fetcher

7f05741

derTimme changed the title ~~[WIP] Fetcher for IACR eprints~~ Fetcher for IACR eprints Dec 7, 2017

tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 7, 2017

Fix codacy issues in the IACR eprint fetcher

ff6bfa7

codacy complained about reassigning a method parameter and about the visibility of a test method.

tobiasdiez approved these changes Dec 10, 2017

View reviewed changes

lenhard approved these changes Dec 10, 2017

View reviewed changes

derTimme and others added 2 commits December 11, 2017 19:10

Merge localization files with master

f8b671d

Merge branch 'master' into iacr_eprint_fetcher

108b8a9

derTimme and others added 4 commits December 12, 2017 08:24

Merge remote-tracking branch 'upstream/master' into iacr_eprint_fetcher

4ebb7db

Fix duplicate keys in localization files

031aa87

Merge branch 'master' into iacr_eprint_fetcher

116de21

Merge branch 'master' into iacr_eprint_fetcher

079ba27

Siedlerchr merged commit c2d0070 into JabRef:master Dec 13, 2017

koppor mentioned this pull request May 31, 2022

Rework ICAR Fetcher #8876

Closed

Fetcher for IACR eprints #3473

Fetcher for IACR eprints #3473

Conversation

derTimme commented Nov 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Siedlerchr commented Nov 30, 2017

derTimme commented Nov 30, 2017 • edited Loading

derTimme commented Dec 1, 2017

tobiasdiez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derTimme Dec 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derTimme commented Dec 2, 2017

derTimme commented Dec 7, 2017 • edited Loading

tobiasdiez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derTimme commented Dec 7, 2017

tobiasdiez commented Dec 7, 2017

lenhard left a comment

Choose a reason for hiding this comment

derTimme commented Dec 10, 2017

lenhard commented Dec 11, 2017

derTimme commented Dec 11, 2017

Siedlerchr commented Dec 11, 2017

Siedlerchr commented Dec 13, 2017

tobiasdiez commented Dec 13, 2017 • edited Loading

derTimme commented Dec 13, 2017

derTimme commented Nov 30, 2017 •

edited

Loading

derTimme commented Nov 30, 2017 •

edited

Loading

derTimme Dec 2, 2017 •

edited

Loading

derTimme commented Dec 7, 2017 •

edited

Loading

tobiasdiez commented Dec 13, 2017 •

edited

Loading