Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Dublin Core #3710

Merged
merged 17 commits into from
Feb 20, 2018
Merged

[WIP] Dublin Core #3710

merged 17 commits into from
Feb 20, 2018

Conversation

johannes-manner
Copy link
Collaborator

@johannes-manner johannes-manner commented Feb 7, 2018

Fixes #938 : Only supporting Dublin Core

Removed JabRef namespace for XMP 'xmlns:bibtex'.
Removed some unused methods in gui/importer/EntryFromPDFCreator.java (discussed with @koppor and @stefan-kolb)
Deleted tests for XMPUtil. (New ones are in progress).


  • Change in CHANGELOG.md described
  • Tests created for changes
  • Screenshots added (for bigger UI changes)
  • Manually tested changed features in running JabRef
  • Check documentation status (Issue created for outdated help page at help.jabref.org?)

PdfXmpImporterTest:testImportEntries()
PdfXmpImporterTest:testIsRecognizedFormat()

Removed to much code when refactoring the XMPUtil. Non XMP metadata are also relevent, when retrieving org.apache.pdfbox.pdmodel.PDDocumentInformation
Update pdfbox and fontbox from 1.8.13 to 2.0.8 and migritate from jempbox to xmpbox.  See pull JabRef#1096.

Next step: Writing test cases for XMPUtil (DublinCore).
build.gradle Outdated
compile 'org.apache.pdfbox:pdfbox:1.8.13'
compile 'org.apache.pdfbox:fontbox:1.8.13'
compile 'org.apache.pdfbox:jempbox:1.8.13'
// update possible because custom metadata format was dropped, https://github.com/JabRef/jabref/pull/3710 - https://github.com/JabRef/jabref/pull/1096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove that line

Travis build failed because of an exceeded NPath complexity of a single method. Method was refactored and cleaned up.

BibEntry entry = new BibEntry();

// Contributor -> Editor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not comment these functions as the implementation might change... The comments could co at setEditor etc. as method comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will change this with the next commit.

if (entries.isEmpty()) {
System.err.println("Could not find BibEntry in " + args[0]);
} else {
XMPUtil.writeXMP(new File(args[1]), entries, result.getDatabase(), false, xmpPreferences);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this also be either a Path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The parameter is a file or sometimes also the filePath as a string is possible.

}

if (args[0].endsWith(".bib") && args[1].endsWith(".pdf")) {
ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(new FileReader(args[0]));
try (FileReader reader = new FileReader(args[0])) {
ParserResult result = new BibtexParser(importFormatPreferences, Globals.getFileUpdateMonitor()).parse(reader);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but does the parser accepts an inputStream? Then you could use Files.newInputStream...
would be actually preferable over a fileReader

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No overloaded parse method, which would accept an InputStream.

The tests cover the most important use cases, which include reading and writing metadata from pdf files. Both formats, DublinCore and PDMetadata (which are no XMP metadata) are tested.
@johannes-manner johannes-manner changed the title [WIP] Dublin Core Dublin Core Feb 15, 2018
@johannes-manner
Copy link
Collaborator Author

This pull request is ready for the final review and for merging into master :)

@tobiasdiez tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Feb 15, 2018
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work. Finally, something is moving concerning pdf support 🥇 .

The code looks pretty good and I only have a bunch of minor remarks.

throw new EncryptedPdfsNotSupportedException();
}
}
public static PDDocument loadWithAutomaticDecryption(File file) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Path as long as possible (i.e. until we use an interface that we cannot control and which expects a traditional File). This remark applies to this method and a few other places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
XMPUtil.writeXMP(new File(fileName), entry, database, xmpPreferences);
BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
XMPUtil.writeXMP(Paths.get(fileName), entry, database, xmpPreferences);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already touching most of the code in this class, it would be nice if you could rework the whole class from a collection of static helper methods to a "normal" class with instance methods. It appears that the file is passed to every single method so this is a natural candidate for a constructor argument. I'm not sure how much code is shared between writing and reading of xmp, but maybe it makes sense to split the class in an XmpReader and XmpWriter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to separete the logic in two utility classes and a shared utitlity class.
Currently I don't see the benefit of getting "normal" reader and writer. I introduced two Extractors for the DocumentInformation and the DublinCore format. Maybe that's what you have intended.

I pushed a WIP state for another review. Maybe you have further suggestions, how to structure the package.


if (entry.isPresent()) {
if (entry.get().getType() == null) {
entry.get().setType(BibEntry.DEFAULT_TYPE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the part, where the type is set to a default, to the getBibtexEntry method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
public void testReadEmtpyMetadata() throws IOException, URISyntaxException {
List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/empty_metadata.pdf").toURI().getPath(), xmpPreferences);
Assert.assertEquals(0, entries.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEquals(Collections.emptyList, entries) to get a better error message in case the test fails.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"Patterson, David and Arvind and Asanov\\'\\i{}c, Krste and Chiou, Derek and Hoe, James and Kozyrakis, Christos and Lu, S{hih-Lien} and Oskin, Mark and Rabaey, Jan and Wawrzynek, John");
// read a bib entry from the tests before
List<BibEntry> entries = XMPUtil.readXMP(XMPUtil.class.getResource("/org/jabref/logic/xmp/PD_metadata.pdf").toURI().getPath(), xmpPreferences);
BibEntry entry = entries.get(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is easier if you just create the entry by hand (and then write it, read and compare).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no further reason (besides simplicity), I would stay with this.

* Make sure that the privacy filter works.
* The month attribute in DublinCore is the complete name of the month, e.g. March.
* In JabRef, the format is #mar# instead. To get a working unit test, the JabRef's
* bib-entry is altered from #mar# to {March}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why the month shouldn't be post-processed to get a proper bibtex value? We even have the Month class which handles parsing and converting to the correct output.

compile 'org.apache.pdfbox:fontbox:1.8.13'
compile 'org.apache.pdfbox:jempbox:1.8.13'
compile 'org.apache.pdfbox:pdfbox:2.0.8'
compile 'org.apache.pdfbox:fontbox:2.0.8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further below (starting at line 218), we added exceptions for the update dependencies task. These are now invalid and should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


private void extractSubject() {
String s = documentInformation.getSubject();
if (s != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We once agreed on avoiding single char variables. Please use a meaningful name here and for the others

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


private void extractOtherFields() {
COSDictionary dict = documentInformation.getCOSObject();
for (Map.Entry<COSName, COSBase> o : dict.entrySet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could encapsulate part of this as a stream. at least the filtering and mapping to the key.
https://www.mkyong.com/java8/java-8-filter-a-map-examples/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stay with the implemented version. Currently it works and I'm not a stream fanatic 😋

*/
private void extractAuthor() {
List<String> creators = dcSchema.getCreators();
if ((creators != null) && !creators.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a method in our own StringUtil class which combines both checks: IsNullOREmpty

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 366aeed. Please check the conditions again. It's not that natural to negate the isNullOrEmpty... Maybe another function would help in the StringUtils.

@johannes-manner
Copy link
Collaborator Author

johannes-manner commented Feb 16, 2018

For my general understanding and my newbie state:

 public static void writeXMP(Path path,
            Collection<BibEntry> bibtexEntries, BibDatabase database,
            boolean writePDFInfo, XMPPreferences xmpPreferences) throws IOException, TransformerException {

        Collection<BibEntry> resolvedEntries;
        if (database == null) {
            resolvedEntries = bibtexEntries;
        } else {
            resolvedEntries = database.resolveForStrings(bibtexEntries, false);
        }

        try (PDDocument document = PDDocument.load(path.toFile())) {
            if (document.isEncrypted()) {
                throw new EncryptedPdfsNotSupportedException();
            }

            if (writePDFInfo && (resolvedEntries.size() == 1)) {                // 1
                XMPUtilWriter.writeDocumentInformation(document, resolvedEntries
                        .iterator()
                        .next(), null, xmpPreferences);
                XMPUtilWriter.writeDublinCore(document, resolvedEntries, null, xmpPreferences);
            }

            PDDocumentCatalog catalog = document.getDocumentCatalog();
            PDMetadata metaRaw = catalog.getMetadata();`

For me, it makes no sense to write more than one BibEntry to the metadata of a PDF file. Currently the implementation in XMPUtilWriter is also limited to a single element, but implemented as a list. (See line 260 and 110).

Can I drop the list?

@Siedlerchr
@koppor
@tobiasdiez

List<String> dates = dcSchema.getUnqualifiedSequenceValueList("date");
if ((dates != null) && !dates.isEmpty()) {
String date = dates.get(0).trim();
Calendar c = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please stick to the java 8 new date and time api: Use a DateTime or just a DateFormatter
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm");
LocalDateTime dateTime = LocalDateTime.parse(str, formatter);

*/
private void extractBibTexFields() {
List<String> relationships = dcSchema.getRelations();
if (relationships != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a stream with filter and map

*/
private void extractType() {
List<String> l = dcSchema.getTypes();
if ((l != null) && !l.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid single char vairbales

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 366aeed

continue;
}

if (FieldName.EDITOR.equals(field.getKey())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a switch/case here, but that's just my style. The others don't like them that much ;) So you can change it or leave it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it :)

try (PDDocument document = loadWithAutomaticDecryption(path)) {
Optional<XMPMetadata> meta = XMPUtilReader.getXMPMetadata(document);

if (meta.isPresent()) {
Copy link
Member

@Siedlerchr Siedlerchr Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use chained maps on the optionals for each step:
This should work
meta.map(m->m.getDublinCore().map(m->new DublinCore...).map(entry->extractBibEntry).ifPresent(result::add)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current state more readable than the chained map operations... Do you have another suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chanining has the improvement that you don't have to care about nulls or empty optionals inside. The code at the end is only executed if none of the values is null or empty.

Did not see your question. Theoretically it could be that someone has CrossRef linked bibentries, (e.g. a bib entry with a book and another one with a chapter referring the book).
but that's really and edge case. I think for most uses cases one entry is enough. And if the code doesn't support it, then drop it. let's see what @koppor has to say

LOGGER.info("Encryption not supported by XMPUtil");
return false;
} catch (IOException e) {
// happens if no metadata is found, no reason to log the exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always log exceptions, at least add a debug. Just to make sure that not any other underlying exception is swallowed.

Copy link
Collaborator Author

@johannes-manner johannes-manner Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for the comment 👍

*/
public static void writeXMP(Path file, BibEntry entry,
BibDatabase database, XMPPreferences xmpPreferences) throws IOException, TransformerException {
List<BibEntry> l = new LinkedList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for a linked list? In most cases ArrayList is sufficient

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a question, I posted above. This is code from a previous author and I want to drop the List at all. Please comment my question above.

Copy link
Member

@Siedlerchr Siedlerchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the code looks good, just some code style improvements.

@johannes-manner johannes-manner changed the title Dublin Core [WIP] Dublin Core Feb 16, 2018
@tobiasdiez
Copy link
Member

I have no real idea concerning your question about the list. I would say you could safely remove it, but I'm not sure. What happens if multiple entries have the same file linked and write their metadata into it (e.g. the pdf is a book and I have a bunch of bookchapters as separate entries)?

@johannes-manner
Copy link
Collaborator Author

johannes-manner commented Feb 16, 2018

XMPUtilWriter, line 147,148

Before new metadata is written to the PDF, all DublinCoreSchemas are removed. So I think, that there is currently no option to write more than one metaschema.
In your case, I would assume that the last written data is visible (others lost).

@Siedlerchr
Copy link
Member

Well, does the Dublin core schema specify. Multiple entries?

@koppor
Copy link
Member

koppor commented Feb 17, 2018

Specifications are available at. http://dublincore.org/specifications/. The abstract model is there: http://dublincore.org/documents/abstract-model/

grafik

In case, I interpret that correctly, one record contains one description, which may contain multiple record sets.

This is the edge case, where one wants to write XMP to a proceedings.

Example:

XMPUtilWriter supports mutliple metadata entries in dublinCore and a single entry in the PDDocumentInformation.
" urldate = {2017-05-31},\r\n" +
"}";

private static final String vapnik2000 = "@Book{Vapnik2000,\r\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create the BibEntries by hand (using new BibEntry(), setField) and not based on the string representation. The XMP test should be as autonomous as possible, especially they shouldn't fail if the BibParser is changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Reading mulitple BibEntries in DublinCore format now also works. If you want to test the reading of multiple entries, the PDF file JabRef_multipleMetaEntries.pdf contains three metadata entries in DublinCore for testing locally.
@@ -30,7 +35,7 @@ private XMPUtilReader() {
* @param path The path to read the XMPMetadata from.
* @return The XMPMetadata object found in the file
*/
public static Optional<XMPMetadata> readRawXMP(Path path) throws IOException {
public static Optional<List<XMPMetadata>> readRawXMP(Path path) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A List should always be non-null and thus it does not makes sense to wrap an Optional around it. The not-present case corresponds to an empty list, which you can check using isEmpty().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. Done 👍

@johannes-manner
Copy link
Collaborator Author

I considered all comments to the source code and refactored my code.

The last commit alters the behavior of the XMP import:
The previous implementation only imports the first entry and drops the others. The current implementation imports all metadata entries. My thoughts are as follows: It is easier to import additional entries and delete not needed ones compared to importing single entries by hand, if the needed entry is not the first one.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro comments at the test cases.

@Test
public void testReadArticleDublinCoreReadXMP() throws IOException, URISyntaxException, ParseException {

Path path = Paths.get(XMPUtilShared.class.getResource("/org/jabref/logic/xmp/article_dublinCore.pdf").toURI());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does one really need the absolute path here? I thought, just the filename is enough, because it looks up in the current directory (where the resources are mirrored to). If this is the package org.jabref.logic.xmp, it should "just work".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I do not need the absolute path 👍

*/
@Test
public void testReadArticleDublinCoreReadXMP() throws IOException, URISyntaxException, ParseException {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove empty line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* The month attribute in DublinCore is the complete name of the month, e.g. March.
* In JabRef, the format is #mar# instead. To get a working unit test, the JabRef's
* bib-entry is altered from #mar# to {March}.
* <p/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and the next line. The next line should be included in the method name. Maybe, the method has to renamed to testReadArticleDublinCoreReadRawXMP?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course 👍

* <p/>
* Tests the readRawXMP - method
*
* @throws IOException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove empty @throws comments. It is already clear by the method signature which exceptions are thrown.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,137 @@
package org.jabref.logic.xmp;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one empty line (?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -29,7 +29,7 @@
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.util.UpdateField;
import org.jabref.logic.util.io.FileUtil;
import org.jabref.logic.xmp.XMPUtil;
import org.jabref.logic.xmp.XMPUtilShared;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename the class to XmpUtilShared to follow Google's Casing rules? (Similar for method names, ...)

Renamed (hopefully) all occurences of XMP to Xmp to stay with the  Google's Casing rules.
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot for attacking such a huge task right as one of your first projects. Looking forward to see more PRs from you in the future.

@koppor koppor merged commit ea8ccb3 into JabRef:master Feb 20, 2018
@koppor koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Feb 20, 2018
Siedlerchr added a commit that referenced this pull request Feb 24, 2018
* upstream/master: (94 commits)
  Add missing localization for Any file
  Refactor dublin core utility (#3756)
  Add Localization
  Update Architecture Tests to catch static imports (#3766)
  Added <any file type> to the Import File Filter Dialog.
  Don't trim when migrating review field (#3761)
  Reorder again
  Rename confirmation into "Merge fields"
  Fix logic
  Reorder checklist in PR template and add "good commit message"
  Replace x11 by unity7
  Include desktop, desktop-legacy, wayland in snapcraft.yaml
  Improve Dublin Core (#3710)
  Incorporate suggestions by @Siedlerchr
  Update JUnit from 5.1.0-M2 -> 5.1.0
  Update Mockito from 2.13.0 -> 2.15.0
  Update wiremock from 2.14.0 -> 2.15.0
  Fix exceptions for jacoco
  update gradle
  Add link to contribute.jabref.org (#3748)
  ...

# Conflicts:
#	src/main/java/org/jabref/gui/JabRefFrame.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants