Added option to import CFF files #7946

AidanM11 · 2021-07-29T22:22:46Z

Added ability to import CFF files as requested in #7945. As per CFF specifications, there is only one entry per CFF file, and CFF entries are always of the type "Software". JabRef does not support all the fields that CFF does, so a fair amount of information could be lost on import (specifically details about the authors beyond their names). I still need to test this more thoroughly and add automated tests, so I've marked this as a draft. I also plan on adding export support.

fixes #7945

Images:

Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Siedlerchr · 2021-07-30T05:44:55Z

Thanks for tackling this issue. For the Authors take a look at Jabrefs Author List and Authors parser. Maybe you can use that.
For fields that do not map to standard fields, you could import/add them with a prefix

Siedlerchr · 2021-07-30T07:53:00Z

For the java classes from the yaml, you can generate them from https://www.jsonschema2pojo.org/

Siedlerchr · 2021-08-16T13:24:48Z

@AidanM11 What is the status here? It would be really cool if you could continue working on this

AidanM11 · 2021-08-16T23:01:48Z

Sorry about the delay, had a busy couple weeks. I've added tests and made some minor changes to the code. As of now, the importer can parse and map most of the fields CFF supports. The main exception is the person/entity system from CFF. In CFF, the authors and contact fields are lists of Person or Entity objects, which each have their own extensive lists of fields for all kinds of information. It seems like it would be very difficult to fit that system into the way jabref handles fields, so instead I set it up so that the importer just extracts the names of each object in the 'authors' list and concatenates them into an author string. In this case, the import operation seems unavoidably lossy.

Similarly, CFF supports a list of 'identifiers' of different types (url, swh, other). JabRef does have support for these fields, but the CFF specs appear allow the user to add as many of these identifiers as they like (I'm not entirely sure about this, but it seems like it). So a single entry could have multiple SWH identifiers, which seems tricky to work into the JabRef system.

However, there are also some 'static' fields in CFF that don't have JabRef counterparts (specifically: commit, license-url, repository-code, repository-artifact). You mentioned earlier that if CFF fields don't map to JabRef standard fields, I could "import/add them with a prefix". I wasn't sure what that meant, and was unable to find any examples of it in the codebase, so I was wondering if I could get a little more clarification on that.

So, overall, the importer is nearly finished, I just need to work out what to do with identifiers (maybe just use the first one, or do a comma-separated list?) and also what to do about the 'static' fields without mappings.

Siedlerchr · 2021-08-17T13:38:11Z

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java

+        hm.put("abstract", StandardField.ABSTRACT);
+        hm.put("message", StandardField.COMMENT);
+        hm.put("date-released", StandardField.DATE);
+


For fields that can't be mapped, you can simply create a new UnkownField: (example is taken from the MSBibImporter)
And set CFF_PREFIX to sth ilke cff-
new UnknownField(CFF_PREFIX + "medium"), "Medium");

JabRef can show non-standard fields in the entry editor as well

I think there should be a better solution than prefixing those fields, since none of the citation styles would pick them up - but you might want to have the repository printed in the reference list. I've started a discussion at the biblatex repo how to handle those fields: plk/biblatex#1169

Siedlerchr · 2021-08-17T13:44:15Z

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java

+            String aAlias = auth.vals.get("alias");
+
+            if (aName != null) {
+                authors.add(aName);


JabRef has a class called Author which will handle the construction of authors, see org.jabref.model.entry
public Author(String first, String firstabbr, String von, String last, String jr)
Afterwards, you put in an AuthorLIst and return the and separated value
See for example the Medra-Fetcher (org.jabref.logic.importer.fetcher)

Apart from String firstabbr, CFF has the same fields for person type objects (e.g., authors). The main difference between these and the entity type objects (for this use case at least) is that entities only have a single field name instead, which I guess could be imported as is?

first -> given-names

von -> name-particle

last -> family-names

jr -> name-suffix

Yep, that seems to be correct, for reference:

jabref/src/main/java/org/jabref/model/entry/Author.java

Lines 22 to 31 in cef4151

/**

* Creates the Author object. If any part of the name is absent, <CODE>null</CODE> must be passed; otherwise other methods may return erroneous results.

*

* @param first the first name of the author (may consist of several tokens, like "Charles Louis Xavier Joseph" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")

* @param firstabbr the abbreviated first name of the author (may consist of several tokens, like "C. L. X. J." in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin"). It is a responsibility of the caller to create a reasonable abbreviation of the first name.

* @param von the von part of the author's name (may consist of several tokens, like "de la" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")

* @param last the last name of the author (may consist of several tokens, like "Vall{\'e}e Poussin" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")

* @param jr the junior part of the author's name (may consist of several tokens, like "Jr. III" in "Smith, Jr. III, John")

*/

public Author(String first, String firstabbr, String von, String last, String jr) {

sdruskat · 2021-08-20T07:35:19Z

Hi 👋, CFF co-lead here. It's really great to see this work happening for JabRef!

As per CFF specifications, there is only one entry per CFF file, and CFF entries are always of the type "Software".

FYI: as of version 1.2.0, CFF also supports datasets experimentally (type: dataset).

sdruskat · 2021-08-20T07:44:33Z

Similarly, CFF supports a list of 'identifiers' of different types (url, swh, other). JabRef does have support for these fields, but the CFF specs appear allow the user to add as many of these identifiers as they like (I'm not entirely sure about this, but it seems like it). So a single entry could have multiple SWH identifiers, which seems tricky to work into the JabRef system.

Yes, people can supply several identifiers for a software (which reflects reality).
Perhaps you could take a hierarchical approach, checking for some things first, then others, etc., and pick one entry based on a heuristic? E.g.,

CFF has root level doi field or exactly one doi entry in the identifiers section: take that
CFF has no DOIs, but has exactly one swh entry in the identifiers: take that
CFF has more than one swh-type identifier, including one with the rel type: take that
etc.

Happy to help with such a heuristic which I think should ideally be based on the software citation principles as much as possible.

However, there are also some 'static' fields in CFF that don't have JabRef counterparts (specifically: commit, license-url, repository-code, repository-artifact). You mentioned earlier that if CFF fields don't map to JabRef standard fields, I could "import/add them with a prefix". I wasn't sure what that meant, and was unable to find any examples of it in the codebase, so I was wondering if I could get a little more clarification on that.

In terms of what the software citation principles suggest, repository-code will contain a value that's potentially more important to keep than the others, and may even be used to override a generic url field. But that's semantics which I'm not sure you want to have in JabRef?

AidanM11 · 2021-08-21T00:22:41Z

Alright, so I've changed the author handling to incorporate the JabRef Author system, as @Siedlerchr suggested, and it seems to be working well.
For the identifiers, I've implemented something similar to the heuristic @sdruskat suggested. As JabRef has separate fields for DOIs and SWHIDs, they have separate heuristics. Currently, they are:

For DOI:

If there is a root-level DOI, use that
If there is a single DOI in the identifiers list, use that
Otherwise, do not use any DOI.

For SWHID

If there is a single SWHID, use that.
If there are multiple SWHIDs, but only one with the "rel" tag, use that one.
Otherwise, do not use any SWHID.

As you can see, when there are multiple identifiers of the same type, the current system just discards them. This doesn't seem ideal, but as the CFF file does not contain any other information that could differentiate those identifiers, it seems like the only alternative is to choose a random one, and that seems like a bad option. I guess it could also put them all into the field in a comma-separated list, but that seems like it goes against the intended use of those fields.

Anyway, besides that, I've still got a small list of things to do before this pull request is finished (updated tests, handling fields without mappings, dataset type support, etc) so I'll keep working on it. Just wanted to make sure this identifier handling is satisfactory.

Siedlerchr · 2021-08-21T20:31:28Z

I think this sounds good for the heuristic, DOI is always a good/valid option. Provides the easiest way to get any addtional metadata if needed.
I also added a devcall label, so that we @JabRef/developers can discuss this in detail in the next devcall if we come up with another idea or addtional considerations. But your approach so far sounds good

jspaaks · 2021-08-24T11:40:23Z

Thanks @AidanM11 for doing this work! I like your heuristics. Perhaps one thing to consider is that it's relatively common to have two DOIs in identifiers: one for the version of the software, and one for the concept / collection / all versions. For such cases, neither would make it through if there isn't a root doi, right? I guess that is by design, since it is a bit of an undecidable problem. Anyway, an alternative could be to pick the first one and use that, something like identifiers.filter(type=doi)[0]. I understand though if you don't like to make this arbitrary choice. I've made a suggestion over at Citation File Format to introduce some kind of qualifier like datacite's isVersionOf, isCitedBy, etc in order to help tooling decide which element of identifiers is most appropriate for their purposes.

src/test/java/org/jabref/logic/importer/fileformat/CffImporterTest.java

Siedlerchr · 2021-08-28T11:02:07Z

Thanks for the work, can you please merge in the latest upstream main branch of JabRef and resolve the conflicts?

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java

Siedlerchr · 2021-08-30T12:38:29Z

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java

+                List<String> relSwhIds = swhIds.stream()
+                                               .filter(id -> id.split(":").length > 3) // quick filter for invalid swhids
+                                               .filter(id -> id.split(":")[2].equals("rel"))
+                                               .collect(Collectors.toList());


and here you can also use findFirst again as above

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java

Siedlerchr

Overall looks good to me and thanks for the tests! I have just some minor improvements regarding the code style.

AidanM11 · 2021-09-01T04:39:12Z

@Siedlerchr I've made the requested changes. Thanks for the review!

Siedlerchr

Thanks, lgtm!

calixtus

Fixed some minor style issues. Otherwise LGTM, thanks for making JabRef better.
As soon as tests are green, we're merging.

* upstream/main: Enable copy&paste for IntelliJ run configuration GitBook: [main] one page modified GitBook: [main] one page modified GitBook: [main] 19 pages and 6 assets modified Refine documentation Oobranch f : add frontend (#7791) Observable Preferences C (General) (#8047) Welcome btut 🎉 Observable preferences B (ProtectedTerms, EntryEditor and MrDlib) (#8046) Refine howto (#8049) Decision on handling of localized preferences. Added option to import CFF files (#7946) Observable preferences A (Appearance and Importer) (#8041) Squashed 'buildres/csl/csl-locales/' changes from ec6d62a9e7..7a507fc008 Squashed 'buildres/csl/csl-styles/' changes from ec4a4c0..2b37392

Added option to import CFF files

32f7297

Added tests, also minor fixes

4ddc414

Siedlerchr reviewed Aug 17, 2021

View reviewed changes

tobiasdiez mentioned this pull request Aug 19, 2021

Add fields for @software based on CFF plk/biblatex#1169

Open

Siedlerchr added the status: devcall label Aug 20, 2021

Improved author handling, added identifier support

119c943

jspaaks mentioned this pull request Aug 24, 2021

Introduce parent-doi citation-file-format/citation-file-format#69

Open

Support for unmapped fields and dataset type, also more tests

b9d3da4

Siedlerchr reviewed Aug 25, 2021

View reviewed changes

src/test/java/org/jabref/logic/importer/fileformat/CffImporterTest.java Outdated Show resolved Hide resolved

sdruskat mentioned this pull request Aug 26, 2021

Define heuristics for identifiers and the way that downstream tools should handle them citation-file-format/citation-file-format#330

Open

Simplified tests, minor fixes

44dd028

AidanM11 marked this pull request as ready for review August 28, 2021 07:16

CHANGELOG.md update

78eb210

Merge branch 'main' into addCffImporter

f4bb7e4

Siedlerchr reviewed Aug 30, 2021

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Aug 30, 2021

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Aug 30, 2021

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Aug 30, 2021

View reviewed changes

src/main/java/org/jabref/logic/importer/fileformat/CffImporter.java Outdated Show resolved Hide resolved

Siedlerchr requested changes Aug 30, 2021

View reviewed changes

Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Aug 30, 2021

code style fixes

33329c8

Siedlerchr approved these changes Sep 1, 2021

View reviewed changes

Fixed minor style issues

4a9d13d

calixtus approved these changes Sep 1, 2021

View reviewed changes

calixtus merged commit 9f0430d into JabRef:main Sep 1, 2021

calixtus removed the status: devcall label Sep 13, 2021

koppor mentioned this pull request Mar 3, 2024

Crafting round-trip test for Cff importer/exporter #10957

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added option to import CFF files #7946

Added option to import CFF files #7946

AidanM11 commented Jul 29, 2021 •

edited by Siedlerchr

Loading

Siedlerchr commented Jul 30, 2021

Siedlerchr commented Jul 30, 2021

Siedlerchr commented Aug 16, 2021

AidanM11 commented Aug 16, 2021

Siedlerchr Aug 17, 2021

tobiasdiez Aug 19, 2021

Siedlerchr Aug 17, 2021

sdruskat Aug 20, 2021

jspaaks Aug 24, 2021

Siedlerchr Aug 24, 2021

sdruskat commented Aug 20, 2021

sdruskat commented Aug 20, 2021

AidanM11 commented Aug 21, 2021

Siedlerchr commented Aug 21, 2021

jspaaks commented Aug 24, 2021

Siedlerchr commented Aug 28, 2021

Siedlerchr Aug 30, 2021

Siedlerchr left a comment

AidanM11 commented Sep 1, 2021

Siedlerchr left a comment

calixtus left a comment

	/**
	* Creates the Author object. If any part of the name is absent, <CODE>null</CODE> must be passed; otherwise other methods may return erroneous results.
	*
	* @param first the first name of the author (may consist of several tokens, like "Charles Louis Xavier Joseph" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
	* @param firstabbr the abbreviated first name of the author (may consist of several tokens, like "C. L. X. J." in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin"). It is a responsibility of the caller to create a reasonable abbreviation of the first name.
	* @param von the von part of the author's name (may consist of several tokens, like "de la" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
	* @param last the last name of the author (may consist of several tokens, like "Vall{\'e}e Poussin" in "Charles Louis Xavier Joseph de la Vall{\'e}e Poussin")
	* @param jr the junior part of the author's name (may consist of several tokens, like "Jr. III" in "Smith, Jr. III, John")
	*/
	public Author(String first, String firstabbr, String von, String last, String jr) {

Added option to import CFF files #7946

Added option to import CFF files #7946

Conversation

AidanM11 commented Jul 29, 2021 • edited by Siedlerchr Loading

Siedlerchr commented Jul 30, 2021

Siedlerchr commented Jul 30, 2021

Siedlerchr commented Aug 16, 2021

AidanM11 commented Aug 16, 2021

Siedlerchr Aug 17, 2021

Choose a reason for hiding this comment

tobiasdiez Aug 19, 2021

Choose a reason for hiding this comment

Siedlerchr Aug 17, 2021

Choose a reason for hiding this comment

sdruskat Aug 20, 2021

Choose a reason for hiding this comment

jspaaks Aug 24, 2021

Choose a reason for hiding this comment

Siedlerchr Aug 24, 2021

Choose a reason for hiding this comment

sdruskat commented Aug 20, 2021

sdruskat commented Aug 20, 2021

AidanM11 commented Aug 21, 2021

Siedlerchr commented Aug 21, 2021

jspaaks commented Aug 24, 2021

Siedlerchr commented Aug 28, 2021

Siedlerchr Aug 30, 2021

Choose a reason for hiding this comment

Siedlerchr left a comment

Choose a reason for hiding this comment

AidanM11 commented Sep 1, 2021

Siedlerchr left a comment

Choose a reason for hiding this comment

calixtus left a comment

Choose a reason for hiding this comment

AidanM11 commented Jul 29, 2021 •

edited by Siedlerchr

Loading