Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox #1096

Closed
wants to merge 13 commits into from

Conversation

lenhard
Copy link
Member

@lenhard lenhard commented Apr 4, 2016

This PR addresses #1004

There are significant changes in the APIs from jempbox and xmpbox. The current state of this PR is just a plain translation from jempbox to xmpbox to get the code to compile. The tests are not working yet, so there are probably some errors in the translation that need to be fixed. Also, travis seems to have problems with xmpbox.

Comments from anyone who is familiar with XMP handling are very welcome.

  • Change in CHANGELOG.md described

@@ -147,7 +136,7 @@ public static void writeXMP(String filename, BibEntry entry,

if (meta.isPresent()) {

List<XMPSchema> schemas = meta.get().getSchemasByNamespaceURI(XMPSchemaBibtex.NAMESPACE);
List<XMPSchema> schemas = meta.get().getAllSchemas();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried around with pdfbox and came to that solution:
XMPSchemaBibtex bib = (XMPSchemaBibtex) meta.get().getSchema(XMPSchemaBibtex.class);
Edit// From my understanding the code was simply looking for the BibTexSchema and the pdfbox internal method already does that traversing.

@Siedlerchr
Copy link
Member

In General it could be helpful to have a look at the DublinCoreSchema Implementation.
I think that could ease the creation of the BibTex schema

@lenhard
Copy link
Member Author

lenhard commented Apr 5, 2016

Thanks for your comments! I am integrating them and am starting to get the tests working.

One problem I am facing is that xmpbox seems to leave out all rdf: tags. Are those important for us?

Edit: The rdf information seems to be inserted only upon serialization.

@koppor
Copy link
Member

koppor commented Apr 6, 2016

Regarding rdf:: This refs #938.

@lenhard
Copy link
Member Author

lenhard commented Apr 6, 2016

@koppor: Thanks, this provides some context. In this PR, I'll only do the migration to the new pdf library though and not to a new format.

@lenhard
Copy link
Member Author

lenhard commented Apr 7, 2016

@JabRef/developers I think I have a run into a show-stopper when it comes to replacing jempbox with xmpbox.

The problem is that the parser that ships with xmpbox, DomXmpParser is very strict with namespaces and cannot parse any xmp meta data that contains non-standard namespaces. Needless to say, our jabref namespaces are not contained in pdf standards... The parser relies on the standard facilities of Dom handling in Java, but it so well encapsulated that it is impossible to inject additional namespaces in any fashion. The following test illustrates this in a nutshell:

  @Test
    public void testParsing() throws XmpParsingException {
        String testData = "<?xpacket begin=\"\" id=\"W5M0MpCehiHzreSzNTczkc9d\"?><x:xmpmeta xmlns:x=\"adobe:ns:meta/\">\n" +
                "  <rdf:RDF xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\">\n" +
                "    <rdf:Description xmlns:dc=\"http://purl.org/dc/elements/1.1/\" rdf:about=\"\">\n" +
                "      <dc:description>\n" +
                "        <rdf:Alt>\n" +
                "          <rdf:li xml:lang=\"x-default\">The success of the Linux operating system has demonstrated the viability of an alternative form of software development � open source software � that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</rdf:li>\n" +
                "        </rdf:Alt>\n" +
                "      </dc:description>\n" +
                "      <dc:creator>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Kelly Clarkson</rdf:li>\n" +
                "          <rdf:li>Ozzy Osbourne</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </dc:creator>\n" +
                "      <dc:relation>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>bibtex/bibtexkey/Clarkson06</rdf:li>\n" +
                "          <rdf:li>bibtex/booktitle/Catch-22</rdf:li>\n" +
                "          <rdf:li>bibtex/journal/International Journal of High Fidelity</rdf:li>\n" +
                "          <rdf:li>bibtex/pdf/YeKis03 - Towards.pdf</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:relation>\n" +
                "      <dc:contributor>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>Huey Duck</rdf:li>\n" +
                "          <rdf:li>Dewey Duck</rdf:li>\n" +
                "          <rdf:li>Louie Duck</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:contributor>\n" +
                "      <dc:subject>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>peanut</rdf:li>\n" +
                "          <rdf:li>butter</rdf:li>\n" +
                "          <rdf:li>jelly</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:subject>\n" +
                "      <dc:title>\n" +
                "        <rdf:Alt>\n" +
                "          <rdf:li xml:lang=\"x-default\">Hypersonic ultra-sound</rdf:li>\n" +
                "        </rdf:Alt>\n" +
                "      </dc:title>\n" +
                "      <dc:date>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>1982-07</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </dc:date>\n" +
                "      <dc:format>application/pdf</dc:format>\n" +
                "      <dc:type>\n" +
                "        <rdf:Bag>\n" +
                "          <rdf:li>InProceedings</rdf:li>\n" +
                "        </rdf:Bag>\n" +
                "      </dc:type>\n" +
                "    </rdf:Description>\n" +
                "    <rdf:Description xmlns:bibtex=\"http://jabref.sourceforge.net/bibteXMP/\" rdf:about=\"\">\n" +
                "      <bibtex:abstract>The success of the Linux operating system has demonstrated the viability of an alternative form of software development � open source software � that challenges traditional assumptions about software markets. Understanding what drives open source developers to participate in open source projects is crucial for assessing the impact of open source software. This article identifies two broad types of motivations that account for their participation in open source projects. The first category includes internal factors such as intrinsic motivation and altruism, and the second category focuses on external rewards such as expected future returns and personal needs. This article also reports the results of a survey administered to open source programmers.</bibtex:abstract>\n" +
                "      <bibtex:author>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Kelly Clarkson</rdf:li>\n" +
                "          <rdf:li>Ozzy Osbourne</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </bibtex:author>\n" +
                "      <bibtex:bibtexkey>Clarkson06</bibtex:bibtexkey>\n" +
                "      <bibtex:booktitle>Catch-22</bibtex:booktitle>\n" +
                "      <bibtex:editor>\n" +
                "        <rdf:Seq>\n" +
                "          <rdf:li>Huey Duck</rdf:li>\n" +
                "          <rdf:li>Dewey Duck</rdf:li>\n" +
                "          <rdf:li>Louie Duck</rdf:li>\n" +
                "        </rdf:Seq>\n" +
                "      </bibtex:editor>\n" +
                "      <bibtex:journal>International Journal of High Fidelity</bibtex:journal>\n" +
                "      <bibtex:keywords>peanut, butter, jelly</bibtex:keywords>\n" +
                "      <bibtex:month>#jul#</bibtex:month>\n" +
                "      <bibtex:pdf>YeKis03 - Towards.pdf</bibtex:pdf>\n" +
                "      <bibtex:title>Hypersonic ultra-sound</bibtex:title>\n" +
                "      <bibtex:year>1982</bibtex:year>\n" +
                "      <bibtex:entrytype>inproceedings</bibtex:entrytype>\n" +
                "    </rdf:Description>\n" +
                "  </rdf:RDF>\n" +
                "</x:xmpmeta><?xpacket end=\"w\"?>";
        InputStream is = new ByteArrayInputStream(testData.getBytes(StandardCharsets.UTF_8));
        DomXmpParser parser = new DomXmpParser();
        XMPMetadata meta = parser.parse(is);
    }

The result is:

   org.apache.xmpbox.xml.XmpParsingException: Cannot find a definition for the namespace http://jabref.sourceforge.net/bibteXMP/
        at org.apache.xmpbox.xml.DomXmpParser.checkPropertyDefinition(DomXmpParser.java:853)
        at org.apache.xmpbox.xml.DomXmpParser.parseChildrenAsProperties(DomXmpParser.java:290)
        at org.apache.xmpbox.xml.DomXmpParser.parseDescriptionRoot(DomXmpParser.java:234)
        at org.apache.xmpbox.xml.DomXmpParser.parse(DomXmpParser.java:198)
        at net.sf.jabref.logic.xmp.XMPUtilTest.testParsing(XMPUtilTest.java:1444)

So unless there is something I did not see, the question is how to proceed. I do not think we should write our own customn xmp parser, as long as jempbox still exists. We might be able to update to pdfbox-2.0.0 and keep jempbox, but that needs to be evaluated separately.

@Siedlerchr
Copy link
Member

From what I see we are not the only ones have problems with the XMPBox DomParser.
Maybe you could ask on the pdfbox mailing list if there is a way to get it done

@koppor
Copy link
Member

koppor commented Apr 8, 2016

+1 for asking at the mailing list. Or report an issue at https://issues.apache.org/jira/browse/PDFBOX/. Others seemed to have had issues too: https://issues.apache.org/jira/browse/PDFBOX-2416.

Are we sure that old JabRef versions wrote the correct XMP data? 😇

Do we really need that XMP thing. Shouldn't we replace it in the long term by something else? See #938 (comment)

I cannot really judge now, because I have too little knowledge about this metadata thing in PDFs.

@lenhard
Copy link
Member Author

lenhard commented Apr 8, 2016

Ok, I will ask at the mailing list, but I get the feeling that the developers of pdfbox switched to xmpbox because they want strict parsing (i.e., rejecting non-standard extensions to xmp meta data.

Regarding the relevance of the XMP feature, I really have no clue. I am not using it and do not know someone who does. If we do not need it, I would be very happy to throw it away. Is there any chance to find someone who knows and uses the feature and can shed some light on this?

We could disable it for v3.3 and wait until someone complains ;-)

@lenhard lenhard mentioned this pull request Apr 8, 2016
@lenhard
Copy link
Member Author

lenhard commented Apr 12, 2016

And here is the reply from the pdfbox mailing list:

This is a known problem, yes xmpbox does not support custom namespaces,
this was noticed too late (xmpbox is closely related to preflight, which
checks for PDF/A). It is on the list of things to discuss for 2.1

"- discussion/decision on XMP (shall we enhance XMPBox, restore Jempbox,
base on Adobe's XMP library, join forces with the FOP project …)"

Until then, the workaround is to keep using jempbox.

So that pretty says it. For now, we cannot switch to xmpbox. I'd suggest to leave this PR open until there is a new release of xmpbox.

@lenhard lenhard changed the title [WIP] Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox Update pdfbox to 2.0.0 and migrate from jempbox to xmpbox Apr 12, 2016
@koppor
Copy link
Member

koppor commented Apr 12, 2016

I just close the issue. We will find it again when querying for on-hold issues.

@tobiasdiez
Copy link
Member

@lenhard
Copy link
Member Author

lenhard commented May 20, 2016

There is no change, really. We cannot use the most recent version of pdfbox, so our options are:

  1. Completely re-write XMP-handling with a different library
  2. Encode everything into the dublin core schema instead of a BibTeX schema
  3. Wait and see if custom schemas are reenabled with pdfbox 2.1

Currently, we are going for option 3. However that might be a long wait. "Long" as in "years".

@koppor
Copy link
Member

koppor commented May 21, 2016

👍 for dublin core. Seems to be the best option.

@lenhard lenhard mentioned this pull request Oct 6, 2016
14 tasks
@koppor koppor mentioned this pull request Dec 12, 2016
7 tasks
@tobiasdiez tobiasdiez mentioned this pull request Mar 30, 2017
6 tasks
@koppor koppor mentioned this pull request Sep 18, 2017
johannes-manner added a commit to johannes-manner/jabref that referenced this pull request Feb 12, 2018
Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox.  See pull JabRef#1096.

Next step: Writing test cases for XMPUtil (DublinCore).
@stefan-kolb stefan-kolb deleted the update-pdfbox branch February 17, 2018 10:12
koppor pushed a commit that referenced this pull request Feb 20, 2018
This fixes #938

- Reading and writing multiple dublinCore entries works: XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally.
- Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation
- Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox.  See pull #1096.
- Refactor extraction from DublinCoreSchema
- The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested.
- Separated XMPUtils in a reader and a writer utitlity class.
- add meaningful names in DublinCoreExtractor and use StringUtils.isNullOrEmpty
- Log exception in XMPUtilShared
@koppor koppor removed the freeze label May 11, 2018
@koppor
Copy link
Member

koppor commented May 11, 2018

This was the basis for #3710, so this is integrated and not a freeze anymore.

@lenhard
Copy link
Member Author

lenhard commented May 12, 2018

I am glad to hear that my work was of some use in the end :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants