[WIP] Dublin Core #3710

johannes-manner · 2018-02-07T19:34:41Z

Fixes #938 : Only supporting Dublin Core

Removed JabRef namespace for XMP 'xmlns:bibtex'.
Removed some unused methods in gui/importer/EntryFromPDFCreator.java (discussed with @koppor and @stefan-kolb)
Deleted tests for XMPUtil. (New ones are in progress).

Change in CHANGELOG.md described
Tests created for changes
Screenshots added (for bigger UI changes)
Manually tested changed features in running JabRef
Check documentation status (Issue created for outdated help page at help.jabref.org?)

Fixes JabRef#938

PdfXmpImporterTest:testImportEntries() PdfXmpImporterTest:testIsRecognizedFormat() Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation

…-core

Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull JabRef#1096. Next step: Writing test cases for XMPUtil (DublinCore).

koppor · 2018-02-13T16:40:04Z

build.gradle

- compile 'org.apache.pdfbox:pdfbox:1.8.13'
- compile 'org.apache.pdfbox:fontbox:1.8.13'
- compile 'org.apache.pdfbox:jempbox:1.8.13'
+ // update possible because custom metadata format was dropped, https://github.com/JabRef/jabref/pull/3710 - https://github.com/JabRef/jabref/pull/1096


Just remove that line

Travis build failed because of an exceeded NPath complexity of a single method. Method was refactored and cleaned up.

koppor · 2018-02-15T12:19:28Z

src/main/java/org/jabref/logic/xmp/XMPUtil.java

+
+ BibEntry entry = new BibEntry();
+
+ // Contributor -> Editor


I would not comment these functions as the implementation might change... The comments could co at setEditor etc. as method comment.

Ok, I will change this with the next commit.

…-core Merge

Siedlerchr · 2018-02-15T14:21:41Z

src/main/java/org/jabref/cli/XMPUtilMain.java

+ if (entries.isEmpty()) {
+ System.err.println("Could not find BibEntry in " + args[0]);
+ } else {
+ XMPUtil.writeXMP(new File(args[1]), entries, result.getDatabase(), false, xmpPreferences);


shouldn't this also be either a Path?

No. The parameter is a file or sometimes also the filePath as a string is possible.

Siedlerchr · 2018-02-15T14:24:09Z

src/main/java/org/jabref/cli/XMPUtilMain.java

 }

 if (args[0].endsWith(".bib") && args[1].endsWith(".pdf")) {
- ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(new FileReader(args[0]));
+ try (FileReader reader = new FileReader(args[0])) {
+ ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(reader);


I am not sure, but does the parser accepts an inputStream? Then you could use Files.newInputStream...
would be actually preferable over a fileReader

No overloaded parse method, which would accept an InputStream.

The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested.

johannes-manner · 2018-02-15T16:40:35Z

This pull request is ready for the final review and for merging into master :)

tobiasdiez

Thanks for your work. Finally, something is moving concerning pdf support 🥇 .

The code looks pretty good and I only have a bunch of minor remarks.

tobiasdiez · 2018-02-15T19:44:32Z

src/main/java/org/jabref/logic/xmp/XMPUtil.java

- throw new EncryptedPdfsNotSupportedException();
- }
- }
+ public static PDDocument loadWithAutomaticDecryption(File file) throws IOException {


Please use Path as long as possible (i.e. until we use an interface that we cannot control and which expects a traditional File). This remark applies to this method and a few other places.

tobiasdiez · 2018-02-15T19:48:18Z

src/main/java/org/jabref/logic/xmp/XMPUtil.java

-  BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
- XMPUtil.writeXMP(new File(fileName), entry, database, xmpPreferences);
+ BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
+ XMPUtil.writeXMP(Paths.get(fileName), entry, database, xmpPreferences);


Since you are already touching most of the code in this class, it would be nice if you could rework the whole class from a collection of static helper methods to a "normal" class with instance methods. It appears that the file is passed to every single method so this is a natural candidate for a constructor argument. I'm not sure how much code is shared between writing and reading of xmp, but maybe it makes sense to split the class in an XmpReader and XmpWriter.

I tried to separete the logic in two utility classes and a shared utitlity class.
Currently I don't see the benefit of getting "normal" reader and writer. I introduced two Extractors for the DocumentInformation and the DublinCore format. Maybe that's what you have intended.

I pushed a WIP state for another review. Maybe you have further suggestions, how to structure the package.

tobiasdiez · 2018-02-15T19:52:50Z

src/main/java/org/jabref/logic/xmp/XMPUtil.java

+
+ if (entry.isPresent()) {
+ if (entry.get().getType() == null) {
+ entry.get().setType(BibEntry.DEFAULT_TYPE);


Can you please move the part, where the type is set to a default, to the getBibtexEntry method.

tobiasdiez · 2018-02-15T20:05:23Z

src/test/java/org/jabref/logic/xmp/XMPUtilTest.java

- }
+ public void testReadEmtpyMetadata() throws IOException, URISyntaxException {
+ List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/empty_metadata.pdf").toURI().getPath(), xmpPreferences);
+ Assert.assertEquals(0, entries.size());


assertEquals(Collections.emptyList, entries) to get a better error message in case the test fails.

tobiasdiez · 2018-02-15T20:09:50Z

src/test/java/org/jabref/logic/xmp/XMPUtilTest.java

- "Patterson, David and Arvind and Asanov\\'\\i{}c, Krste and Chiou, Derek and Hoe, James and Kozyrakis, Christos and Lu, S{hih-Lien} and Oskin, Mark and Rabaey, Jan and Wawrzynek, John");
+ // read a bib entry from the tests before
+ List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/PD_metadata.pdf").toURI().getPath(), xmpPreferences);
+ BibEntry entry = entries.get(0);


I think it is easier if you just create the entry by hand (and then write it, read and compare).

If there is no further reason (besides simplicity), I would stay with this.

tobiasdiez · 2018-02-15T20:12:12Z

src/test/java/org/jabref/logic/xmp/XMPUtilTest.java

- * Make sure that the privacy filter works.
+ * The month attribute in DublinCore is the complete name of the month, e.g. March.
+ * In JabRef, the format is #mar# instead. To get a working unit test, the JabRef's
+ * bib-entry is altered from #mar# to {March}.


Is there a reason why the month shouldn't be post-processed to get a proper bibtex value? We even have the Month class which handles parsing and converting to the correct output.

tobiasdiez · 2018-02-15T20:16:30Z

build.gradle

- compile 'org.apache.pdfbox:fontbox:1.8.13'
- compile 'org.apache.pdfbox:jempbox:1.8.13'
+ compile 'org.apache.pdfbox:pdfbox:2.0.8'
+ compile 'org.apache.pdfbox:fontbox:2.0.8'


Further below (starting at line 218), we added exceptions for the update dependencies task. These are now invalid and should be removed.

Siedlerchr · 2018-02-16T14:26:26Z

src/main/java/org/jabref/logic/xmp/DocumentInformationExtractor.java

+
+ private void extractSubject() {
+ String s = documentInformation.getSubject();
+ if (s != null) {


We once agreed on avoiding single char variables. Please use a meaningful name here and for the others

Siedlerchr · 2018-02-16T14:30:31Z

src/main/java/org/jabref/logic/xmp/DocumentInformationExtractor.java

+
+ private void extractOtherFields() {
+ COSDictionary dict = documentInformation.getCOSObject();
+ for (Map.Entry<COSName, COSBase> o : dict.entrySet()) {


I think you could encapsulate part of this as a stream. at least the filtering and mapping to the key.
https://www.mkyong.com/java8/java-8-filter-a-map-examples/

I would stay with the implemented version. Currently it works and I'm not a stream fanatic 😋

Siedlerchr · 2018-02-16T14:31:30Z

src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

+ */
+ private void extractAuthor() {
+ List<String> creators = dcSchema.getCreators();
+ if ((creators != null) && !creators.isEmpty()) {


I think we have a method in our own StringUtil class which combines both checks: IsNullOREmpty

Done with 366aeed. Please check the conditions again. It's not that natural to negate the isNullOrEmpty... Maybe another function would help in the StringUtils.

johannes-manner · 2018-02-16T14:32:58Z

For my general understanding and my newbie state:

 public static void writeXMP(Path path,
            Collection<BibEntry> bibtexEntries, BibDatabase database,
            boolean writePDFInfo, XMPPreferences xmpPreferences) throws IOException, TransformerException {

        Collection<BibEntry> resolvedEntries;
        if (database == null) {
            resolvedEntries = bibtexEntries;
        } else {
            resolvedEntries = database.resolveForStrings(bibtexEntries, false);
        }

        try (PDDocument document = PDDocument.load(path.toFile())) {
            if (document.isEncrypted()) {
                throw new EncryptedPdfsNotSupportedException();
            }

            if (writePDFInfo && (resolvedEntries.size() == 1)) {                // 1
                XMPUtilWriter.writeDocumentInformation(document, resolvedEntries
                        .iterator()
                        .next(), null, xmpPreferences);
                XMPUtilWriter.writeDublinCore(document, resolvedEntries, null, xmpPreferences);
            }

            PDDocumentCatalog catalog = document.getDocumentCatalog();
            PDMetadata metaRaw = catalog.getMetadata();`

For me, it makes no sense to write more than one BibEntry to the metadata of a PDF file. Currently the implementation in XMPUtilWriter is also limited to a single element, but implemented as a list. (See line 260 and 110).

Can I drop the list?

@Siedlerchr
@koppor
@tobiasdiez

Siedlerchr · 2018-02-16T14:34:36Z

src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

+ List<String> dates = dcSchema.getUnqualifiedSequenceValueList("date");
+ if ((dates != null) && !dates.isEmpty()) {
+ String date = dates.get(0).trim();
+ Calendar c = null;


Please stick to the java 8 new date and time api: Use a DateTime or just a DateFormatter
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm");
LocalDateTime dateTime = LocalDateTime.parse(str, formatter);

Siedlerchr · 2018-02-16T14:36:25Z

src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

+ */
+ private void extractBibTexFields() {
+ List<String> relationships = dcSchema.getRelations();
+ if (relationships != null) {


Why not use a stream with filter and map

Siedlerchr · 2018-02-16T14:36:49Z

src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

+ */
+ private void extractType() {
+ List<String> l = dcSchema.getTypes();
+ if ((l != null) && !l.isEmpty()) {


Please avoid single char vairbales

Done with 366aeed

Siedlerchr · 2018-02-16T14:38:40Z

src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

+ continue;
+ }
+
+ if (FieldName.EDITOR.equals(field.getKey())) {


I would prefer a switch/case here, but that's just my style. The others don't like them that much ;) So you can change it or leave it.

I leave it :)

…lOrEmpty

Siedlerchr · 2018-02-16T14:43:01Z

src/main/java/org/jabref/logic/xmp/XMPUtilReader.java

+ try (PDDocument document = loadWithAutomaticDecryption(path)) {
+ Optional<XMPMetadata> meta = XMPUtilReader.getXMPMetadata(document);
+
+ if (meta.isPresent()) {


You can use chained maps on the optionals for each step:
This should work
meta.map(m->m.getDublinCore().map(m->new DublinCore...).map(entry->extractBibEntry).ifPresent(result::add)

I find the current state more readable than the chained map operations... Do you have another suggestion?

The chanining has the improvement that you don't have to care about nulls or empty optionals inside. The code at the end is only executed if none of the values is null or empty.

Did not see your question. Theoretically it could be that someone has CrossRef linked bibentries, (e.g. a bib entry with a book and another one with a chapter referring the book).
but that's really and edge case. I think for most uses cases one entry is enough. And if the code doesn't support it, then drop it. let's see what @koppor has to say

Siedlerchr · 2018-02-16T14:45:08Z

src/main/java/org/jabref/logic/xmp/XMPUtilShared.java

+ LOGGER.info("Encryption not supported by XMPUtil");
+ return false;
+ } catch (IOException e) {
+ // happens if no metadata is found, no reason to log the exception


Please always log exceptions, at least add a debug. Just to make sure that not any other underlying exception is swallowed.

Done. Thanks for the comment 👍

Siedlerchr · 2018-02-16T14:45:37Z

src/main/java/org/jabref/logic/xmp/XMPUtilWriter.java

+ */
+ public static void writeXMP(Path file, BibEntry entry,
+ BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
+ List<BibEntry> l = new LinkedList<>();


Reason for a linked list? In most cases ArrayList is sufficient

That's a question, I posted above. This is code from a previous author and I want to drop the List at all. Please comment my question above.

Siedlerchr

In general the code looks good, just some code style improvements.

tobiasdiez · 2018-02-16T15:08:48Z

I have no real idea concerning your question about the list. I would say you could safely remove it, but I'm not sure. What happens if multiple entries have the same file linked and write their metadata into it (e.g. the pdf is a book and I have a bunch of bookchapters as separate entries)?

johannes-manner · 2018-02-16T15:15:42Z

XMPUtilWriter, line 147,148

Before new metadata is written to the PDF, all DublinCoreSchemas are removed. So I think, that there is currently no option to write more than one metaschema.
In your case, I would assume that the last written data is visible (others lost).

Siedlerchr · 2018-02-16T19:10:43Z

Well, does the Dublin core schema specify. Multiple entries?

koppor · 2018-02-17T13:30:42Z

Specifications are available at. http://dublincore.org/specifications/. The abstract model is there: http://dublincore.org/documents/abstract-model/

In case, I interpret that correctly, one record contains one description, which may contain multiple record sets.

This is the edge case, where one wants to write XMP to a proceedings.

Example:

One chapter: https://link.springer.com/chapter/10.1007/978-3-540-79230-7_4
Full proceedings: https://link.springer.com/book/10.1007/978-3-540-79230-7

XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation.

tobiasdiez · 2018-02-19T10:16:38Z

src/test/java/org/jabref/logic/xmp/XMPUtilWriterTest.java

+ " urldate = {2017-05-31},\r\n" +
+ "}";
+
+ private static final String vapnik2000 = "@Book{Vapnik2000,\r\n" +


Please create the BibEntries by hand (using new BibEntry(), setField) and not based on the string representation. The XMP test should be as autonomous as possible, especially they shouldn't fail if the BibParser is changed.

Reading mulitple BibEntries in DublinCore format now also works. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally.

tobiasdiez · 2018-02-20T10:46:37Z

src/main/java/org/jabref/logic/xmp/XMPUtilReader.java

@@ -30,7 +35,7 @@ private XMPUtilReader() {
 * @param path The path to read the XMPMetadata from.
 * @return The XMPMetadata object found in the file
 */
- public static Optional<XMPMetadata> readRawXMP(Path path) throws IOException {
+ public static Optional<List<XMPMetadata>> readRawXMP(Path path) throws IOException {


A List should always be non-null and thus it does not makes sense to wrap an Optional around it. The not-present case corresponds to an empty list, which you can check using isEmpty().

Thanks for this comment. Done 👍

Imports all metadata entries of a PDF file.

johannes-manner · 2018-02-20T11:29:39Z

I considered all comments to the source code and refactored my code.

The last commit alters the behavior of the XMP import:
The previous implementation only imports the first entry and drops the others. The current implementation imports all metadata entries. My thoughts are as follows: It is easier to import additional entries and delete not needed ones compared to importing single entries by hand, if the needed entry is not the first one.

koppor

Micro comments at the test cases.

koppor · 2018-02-20T11:57:20Z