Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

lenhard · 2016-04-04T17:16:01Z

This PR addresses #1004

There are significant changes in the APIs from jempbox and xmpbox. The current state of this PR is just a plain translation from jempbox to xmpbox to get the code to compile. The tests are not working yet, so there are probably some errors in the translation that need to be fixed. Also, travis seems to have problems with xmpbox.

Comments from anyone who is familiar with XMP handling are very welcome.

Change in CHANGELOG.md described

Siedlerchr · 2016-04-04T18:20:52Z

src/main/java/net/sf/jabref/logic/xmp/XMPUtil.java

@@ -147,7 +136,7 @@ public static void writeXMP(String filename, BibEntry entry,

 if (meta.isPresent()) {

- List<XMPSchema> schemas = meta.get().getSchemasByNamespaceURI(XMPSchemaBibtex.NAMESPACE);
+ List<XMPSchema> schemas = meta.get().getAllSchemas();


I also tried around with pdfbox and came to that solution:
XMPSchemaBibtex bib = (XMPSchemaBibtex) meta.get().getSchema(XMPSchemaBibtex.class);
Edit// From my understanding the code was simply looking for the BibTexSchema and the pdfbox internal method already does that traversing.

Siedlerchr · 2016-04-04T18:47:35Z

In General it could be helpful to have a look at the DublinCoreSchema Implementation.
I think that could ease the creation of the BibTex schema

lenhard · 2016-04-05T20:01:32Z

Thanks for your comments! I am integrating them and am starting to get the tests working.

One problem I am facing is that xmpbox seems to leave out all rdf: tags. Are those important for us?

Edit: The rdf information seems to be inserted only upon serialization.

koppor · 2016-04-06T05:51:44Z

Regarding rdf:: This refs #938.

lenhard · 2016-04-06T08:31:48Z

@koppor: Thanks, this provides some context. In this PR, I'll only do the migration to the new pdf library though and not to a new format.

lenhard · 2016-04-07T10:47:04Z

@JabRef/developers I think I have a run into a show-stopper when it comes to replacing jempbox with xmpbox.

The problem is that the parser that ships with xmpbox, DomXmpParser is very strict with namespaces and cannot parse any xmp meta data that contains non-standard namespaces. Needless to say, our jabref namespaces are not contained in pdf standards... The parser relies on the standard facilities of Dom handling in Java, but it so well encapsulated that it is impossible to inject additional namespaces in any fashion. The following test illustrates this in a nutshell:

  @Test
    public void testParsing() throws XmpParsingException {
        String testData = "<?xpacket begin=\"ï»¿\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?><x:xmpmeta xmlns:x=\"adobe:ns:meta/\">\n" +
                "  <rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n" +
                "    <rdf:Description xmlns:dc=\"http://purl.org/dc/elements/1.1/\" rdf:about=\"\">\n" +
                "      <dc:description>\n" +
                "        <rdf:Alt>\n" +
                "          <rdf:li xml:lang=\"x-default\">The success of the Linux operating system has demonstrated the viability of an alternative form of software development ï¿½ open source software ï¿½ that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</rdf:li>\n" +
                "        </rdf:Alt>\n" +
                "      </dc:description>\n" +
                "      <dc:creator>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Kelly Clarkson</rdf:li>\n" +
                "          <rdf:li>Ozzy Osbourne</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </dc:creator>\n" +
                "      <dc:relation>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>bibtex/bibtexkey/Clarkson06</rdf:li>\n" +
                "          <rdf:li>bibtex/booktitle/Catch-22</rdf:li>\n" +
                "          <rdf:li>bibtex/journal/International Journal of High Fidelity</rdf:li>\n" +
                "          <rdf:li>bibtex/pdf/YeKis03 - Towards.pdf</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:relation>\n" +
                "      <dc:contributor>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>Huey Duck</rdf:li>\n" +
                "          <rdf:li>Dewey Duck</rdf:li>\n" +
                "          <rdf:li>Louie Duck</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:contributor>\n" +
                "      <dc:subject>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>peanut</rdf:li>\n" +
                "          <rdf:li>butter</rdf:li>\n" +
                "          <rdf:li>jelly</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:subject>\n" +
                "      <dc:title>\n" +
                "        <rdf:Alt>\n" +
                "          <rdf:li xml:lang=\"x-default\">Hypersonic ultra-sound</rdf:li>\n" +
                "        </rdf:Alt>\n" +
                "      </dc:title>\n" +
                "      <dc:date>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>1982-07</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </dc:date>\n" +
                "      <dc:format>application/pdf</dc:format>\n" +
                "      <dc:type>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>InProceedings</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:type>\n" +
                "    </rdf:Description>\n" +
                "    <rdf:Description xmlns:bibtex=\"http://jabref.sourceforge.net/bibteXMP/\" rdf:about=\"\">\n" +
                "      <bibtex:abstract>The success of the Linux operating system has demonstrated the viability of an alternative form of software development ï¿½ open source software ï¿½ that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</bibtex:abstract>\n" +
                "      <bibtex:author>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Kelly Clarkson</rdf:li>\n" +
                "          <rdf:li>Ozzy Osbourne</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </bibtex:author>\n" +
                "      <bibtex:bibtexkey>Clarkson06</bibtex:bibtexkey>\n" +
                "      <bibtex:booktitle>Catch-22</bibtex:booktitle>\n" +
                "      <bibtex:editor>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Huey Duck</rdf:li>\n" +
                "          <rdf:li>Dewey Duck</rdf:li>\n" +
                "          <rdf:li>Louie Duck</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </bibtex:editor>\n" +
                "      <bibtex:journal>International Journal of High Fidelity</bibtex:journal>\n" +
                "      <bibtex:keywords>peanut, butter, jelly</bibtex:keywords>\n" +
                "      <bibtex:month>#jul#</bibtex:month>\n" +
                "      <bibtex:pdf>YeKis03 - Towards.pdf</bibtex:pdf>\n" +
                "      <bibtex:title>Hypersonic ultra-sound</bibtex:title>\n" +
                "      <bibtex:year>1982</bibtex:year>\n" +
                "      <bibtex:entrytype>inproceedings</bibtex:entrytype>\n" +
                "    </rdf:Description>\n" +
                "  </rdf:RDF>\n" +
                "</x:xmpmeta><?xpacket end=\"w\"?>";
        InputStream is = new ByteArrayInputStream(testData.getBytes(StandardCharsets.UTF_8));
        DomXmpParser parser = new DomXmpParser();
        XMPMetadata meta = parser.parse(is);
    }

The result is:

   org.apache.xmpbox.xml.XmpParsingException: Cannot find a definition for the namespace http://jabref.sourceforge.net/bibteXMP/
        at org.apache.xmpbox.xml.DomXmpParser.checkPropertyDefinition(DomXmpParser.java:853)
        at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:290)
        at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
        at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
        at net.sf.jabref.logic.xmp.XMPUtilTest.testParsing(XMPUtilTest.java:1444)

So unless there is something I did not see, the question is how to proceed. I do not think we should write our own customn xmp parser, as long as jempbox still exists. We might be able to update to pdfbox-2.0.0 and keep jempbox, but that needs to be evaluated separately.

Siedlerchr · 2016-04-07T11:02:55Z

From what I see we are not the only ones have problems with the XMPBox DomParser.
Maybe you could ask on the pdfbox mailing list if there is a way to get it done

koppor · 2016-04-08T05:38:56Z

+1 for asking at the mailing list. Or report an issue at https://issues.apache.org/jira/browse/PDFBOX/. Others seemed to have had issues too: https://issues.apache.org/jira/browse/PDFBOX-2416.

Are we sure that old JabRef versions wrote the correct XMP data? 😇

Do we really need that XMP thing. Shouldn't we replace it in the long term by something else? See #938 (comment)

I cannot really judge now, because I have too little knowledge about this metadata thing in PDFs.

lenhard · 2016-04-08T11:05:49Z

Ok, I will ask at the mailing list, but I get the feeling that the developers of pdfbox switched to xmpbox because they want strict parsing (i.e., rejecting non-standard extensions to xmp meta data.

Regarding the relevance of the XMP feature, I really have no clue. I am not using it and do not know someone who does. If we do not need it, I would be very happy to throw it away. Is there any chance to find someone who knows and uses the feature and can shed some light on this?

We could disable it for v3.3 and wait until someone complains ;-)

lenhard · 2016-04-12T11:33:13Z

And here is the reply from the pdfbox mailing list:

This is a known problem, yes xmpbox does not support custom namespaces,
this was noticed too late (xmpbox is closely related to preflight, which
checks for PDF/A). It is on the list of things to discuss for 2.1

"- discussion/decision on XMP (shall we enhance XMPBox, restore Jempbox,
base on Adobe's XMP library, join forces with the FOP project …)"

Until then, the workaround is to keep using jempbox.

So that pretty says it. For now, we cannot switch to xmpbox. I'd suggest to leave this PR open until there is a new release of xmpbox.

koppor · 2016-04-12T21:10:02Z

I just close the issue. We will find it again when querying for on-hold issues.

tobiasdiez · 2016-05-19T21:41:51Z

What is the status here? I couldn't find any related bug on https://issues.apache.org/jira/browse/PDFBOX/fixforversion/12328837/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel.

lenhard · 2016-05-20T06:38:22Z

There is no change, really. We cannot use the most recent version of pdfbox, so our options are:

Completely re-write XMP-handling with a different library
Encode everything into the dublin core schema instead of a BibTeX schema
Wait and see if custom schemas are reenabled with pdfbox 2.1

Currently, we are going for option 3. However that might be a long wait. "Long" as in "years".

koppor · 2016-05-21T10:23:00Z

👍 for dublin core. Seems to be the best option.

Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull JabRef#1096. Next step: Writing test cases for XMPUtil (DublinCore).

This fixes #938 - Reading and writing multiple dublinCore entries works: XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally. - Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation - Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox. See pull #1096. - Refactor extraction from DublinCoreSchema - The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested. - Separated XMPUtils in a reader and a writer utitlity class. - add meaningful names in DublinCoreExtractor and use StringUtils.isNullOrEmpty - Log exception in XMPUtilShared

koppor · 2018-05-11T05:49:11Z

This was the basis for #3710, so this is integrated and not a freeze anymore.

lenhard · 2018-05-12T20:59:04Z

I am glad to hear that my work was of some use in the end :)

lenhard added the type: enhancement label Apr 4, 2016

lenhard added 2 commits April 4, 2016 20:08

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox

9f33b94

Add missing updates to pdfbox

ba1dc69

Siedlerchr reviewed Apr 4, 2016
View reviewed changes

lenhard added 6 commits April 5, 2016 18:48

Replace unnecessary getters with usages of superclass methods

95217dc

Fix wrong sorting of constructor parameters

642e14d

Remove unused methods and imports

e6a902b

Retrieve schema by class

4506785

Fix compile errors in XMPSchemaBibtexTest

e49b221

Fix compile errors in XMPUtilTest

1cd9f9f

lenhard added 2 commits April 5, 2016 22:58

Fix testGetSetPersonList and testGetAllProperties

faa683a

Check for existence of rdf data in serialized version

9cf8907

lenhard added 2 commits April 6, 2016 10:53

Make all tests in XMPSchemaBibtexTest pass

0b34add

Add failing parsing test

38f1137

lenhard mentioned this pull request Apr 8, 2016

BibTeXML vs. bibteXMP #938

Closed

Fix creation of DC schema

055aa42

lenhard added the on-hold label Apr 12, 2016

lenhard changed the title ~~[WIP] Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox~~ Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox Apr 12, 2016

koppor closed this Apr 12, 2016

koppor mentioned this pull request Apr 12, 2016

Update Apache PDFBox from 1.8.11 to 2.0.0 #1004

Closed

stefan-kolb added the on-hold label Aug 9, 2016

lenhard mentioned this pull request Oct 6, 2016

[WIP] Show PDF-Comments in JabRef #1883

Closed

14 tasks

koppor mentioned this pull request Dec 12, 2016

Reenable more tests #2371

Merged

7 tasks

tobiasdiez mentioned this pull request Mar 30, 2017

Add PDF Viewer #2692

Merged

6 tasks

koppor mentioned this pull request Sep 18, 2017

Exceptions regarding XMP #3226

Closed

stefan-kolb deleted the update-pdfbox branch February 17, 2018 10:12

koppor removed the freeze label May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

lenhard commented Apr 4, 2016

Siedlerchr Apr 4, 2016

Siedlerchr commented Apr 4, 2016

lenhard commented Apr 5, 2016

koppor commented Apr 6, 2016

lenhard commented Apr 6, 2016

lenhard commented Apr 7, 2016

Siedlerchr commented Apr 7, 2016

koppor commented Apr 8, 2016

lenhard commented Apr 8, 2016

lenhard commented Apr 12, 2016

koppor commented Apr 12, 2016

tobiasdiez commented May 19, 2016

lenhard commented May 20, 2016

koppor commented May 21, 2016

koppor commented May 11, 2018

lenhard commented May 12, 2018

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

Conversation

lenhard commented Apr 4, 2016

Siedlerchr Apr 4, 2016

Choose a reason for hiding this comment

Siedlerchr commented Apr 4, 2016

lenhard commented Apr 5, 2016

koppor commented Apr 6, 2016

lenhard commented Apr 6, 2016

lenhard commented Apr 7, 2016

Siedlerchr commented Apr 7, 2016

koppor commented Apr 8, 2016

lenhard commented Apr 8, 2016

lenhard commented Apr 12, 2016

koppor commented Apr 12, 2016

tobiasdiez commented May 19, 2016

lenhard commented May 20, 2016

koppor commented May 21, 2016

koppor commented May 11, 2018

lenhard commented May 12, 2018