Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error importing groupstree version 3 generated by Better BibTeX for Zotero #2477

Closed
retorquere opened this issue Jan 22, 2017 · 26 comments
Closed
Assignees
Labels
bug Confirmed bugs or reports that are very likely to be bugs groups type: documentation

Comments

@retorquere
Copy link

I'm the author of Zotero Better BibTeX; my extension generates BibTeX including a JabRef groupstree that is intended to import cleanly into JabRef. Recenty I've had reports that this import fails; this means my implementation of the groupstree is faulty, but as I cannot find any documentation on the format, I don't know in what way it is faulty.

I have a sample at https://drive.google.com/open?id=0BxFpK0V-elKWSVE1ejdLVkdiNXc, if someone could have a look at what I did wrong it'd be hugely appreciated. The BBT-generated groupstree is very simple and always only has ExplicitGroup entries.

@retorquere retorquere changed the title Importing groupstree version 3 generated by Better BibTeX for Zotero Error importing groupstree version 3 generated by Better BibTeX for Zotero Jan 22, 2017
@lenhard
Copy link
Member

lenhard commented Jan 23, 2017

Hi @retorquere

Thanks for your report and thanks for supporting an integration with JabRef. Recently, there have been considerable changes in the way in which information about groups is serialized. This is probably the reason for why an import of your generated BibTeX no longer works. Unfortunately, the current serialization is not documented as well.

I am far from being an expert on groups and hopefully our group expert @tobiasdiez can find the time to look at this. The most notable change is that the group information is serialized together with an entry in a field and the meta data only contain the structure of the groups tree. Here is an example of one entry with a group:

@Article{somekey,
  author    = {Author Author},
  title     = {Title Title},
  groups    = {group},
}

@Comment{jabref-meta: groupstree:
0 AllEntriesGroup:;
1 ExplicitGroup:group\;0\;;
}

Maybe we should take this issue as an opportunity to document the group syntax of JabRef.

@retorquere
Copy link
Author

retorquere commented Jan 23, 2017

Oh wow that's a very significant change. This would mean groups get some kind of unique ID then. It'd be good to know what kind of restrictions are in place on group IDs as I would have to generate them, and the Zotero collection names from which I'd have to do so will almost certainly contain illegal characters. And the names are not by necessity unique even on a single parent.

Is this format already out in the wild? Or is there still opportunity to weigh in on the format?

@retorquere
Copy link
Author

Unless #1495 is still at play? Would that mean that an entry belongs to any group which happens to have that exact name?

@lenhard
Copy link
Member

lenhard commented Jan 23, 2017

As far as I am aware, special characters in groups should not be a problem. However, somebody else should confirm this.

The format is indeed out in the wild. Nevertheless, if you have a suggestion for improving it, please go ahead and let us know. The format is not set in stone and even if we do not follow a suggestion, discussing the format here helps to clarify it. Regarding #1495: Duplicate group names are still causing us massive headaches and, unfortunately, there is no solution in sight for this issue.

@retorquere
Copy link
Author

My go-to solution for such things would have been json, either inside the comment or by just having the other braces already present on the comment block be the outer braces on a json construct. I know this might run into unbalanced braces if any of the values or keys has braces in them, but there's several ways to tackle that:

  1. Url-encode braces and percentage signs. That would leave the groups very readable but still safe.
  2. Base64 the whole lot. Simple, but non-human-readable.
  3. Skip the @comment markers Do something like what's below
{ "jabref-groupstree": 7,
  <any other stuff>
} 

The "magic" here would be to have the first line always formatted that way, and the closing brace be the only one in that formatting to be the first line with a non-space first character. Since bib(la) tex ignores anything that's not a valid reference it should be safe to put there but through special formatting it that way it would both be valid json (super easy to parse) and easily picked out while scanning line-by-line.

@tobiasdiez
Copy link
Member

tobiasdiez commented Jan 23, 2017

One of the aims of the syntax change was to let the user edit the group membership by hand in the bibtex code. This would not be possible if the identifier is a magic hash since nobody can remember such a string (this is also one of the reasons, why we have problems with duplicate group names).

In any case, JabRef should still understand the old groups tree format and convert it to the new one automatically. Thus, it should be possible to open the above bib file without any problem. For now I have no idea where the issue lies and sadly I have also not the time right now to debug it further. Sorry

Special characters in the group name shouldn't be a problem (in principle, I'm never sure about anything with JabRef's code 😸 ).

@tobiasdiez tobiasdiez added the bug Confirmed bugs or reports that are very likely to be bugs label Jan 23, 2017
@retorquere
Copy link
Author

But the json would then serve the dual purpose of being easier to parse and to edit for the user - even just keeping the old groups tree idea with the keys in the groups.

@retorquere
Copy link
Author

I've only had reports that JabRef can't open that sample file without discarding the groups (although it does warn it will). On my macbook, I can't open the file at all. JabRef says that it's importing the file but doesn't seem to be doing anything, even if I just let it sit for 10 minutes.

@lenhard
Copy link
Member

lenhard commented Jan 24, 2017

JSON is theoretically easy to parse, but the problem is that we do not start from scratch. JabRef has a rather complex parser and supporting a new style of format would essentially result in a re-write of this parser. That doesn't mean that you do not have a point, and we should keep the suggestion in mind in case we go for that option one day.

Regarding your bib file: When opening it, the error console shows an exception in the groups parser. This might be fixable:

11:23:00.921 [AWT-EventQueue-0] INFO  net.sf.jabref.logic.importer.OpenDatabase - Opening: \\nas-a1\redirected$\jorglenh\Desktop\test.bib
11:23:29.232 [JabRef CachedThreadPool] ERROR net.sf.jabref.FallbackExceptionHandler - Uncaught exception Occurred in Thread[JabRef CachedThreadPool,6,main]
java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(Unknown Source) ~[?:1.8.0_111]
	at java.lang.Integer.parseInt(Unknown Source) ~[?:1.8.0_111]
	at java.lang.Integer.parseInt(Unknown Source) ~[?:1.8.0_111]
	at net.sf.jabref.logic.importer.util.GroupsParser.explicitGroupFromString(GroupsParser.java:124) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.util.GroupsParser.fromString(GroupsParser.java:85) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.util.GroupsParser.importGroups(GroupsParser.java:42) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.util.MetaDataParser.parse(MetaDataParser.java:62) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.util.MetaDataParser.parse(MetaDataParser.java:32) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.fileformat.BibtexParser.parseFileContent(BibtexParser.java:237) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.fileformat.BibtexParser.parse(BibtexParser.java:169) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.fileformat.BibtexParser.parse(BibtexParser.java:89) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.fileformat.BibtexImporter.importDatabase(BibtexImporter.java:70) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.Importer.importDatabase(Importer.java:74) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.fileformat.BibtexImporter.importDatabase(BibtexImporter.java:64) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.logic.importer.OpenDatabase.loadDatabase(OpenDatabase.java:66) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.gui.importer.actions.OpenDatabaseAction.openTheFile(OpenDatabaseAction.java:212) ~[JabRef-3.8.jar:?]
	at net.sf.jabref.gui.importer.actions.OpenDatabaseAction.lambda$openFiles$0(OpenDatabaseAction.java:152) ~[JabRef-3.8.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_111]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_111]

@retorquere
Copy link
Author

I don't know the jabref parser of course, but I figured it'd needed to read the whole @comment block in any case before decoding it. Does the jabref parser do streaming decoding of that block? If so, streaming parsing would still be possible but I can see why it'd be low-prio to JabRef.

How do I get to the error console, and what does this error tell me? Does it give me a position where it finds the offending input? I think it expects a number at the failure position, but from my reverse engineering of the groups format (and my limited use of it) there should be only two places a number is expected:

  1. The level in the hierarchy at the start of the group
  2. The intersection model of the group (which is always 0 for BBT-generated group hierarchies)

but the sample file has those two on all of the lines.

Is the line-wrapping required for jabref BTW or is it a convenience feature only? And should I move to the new groups format or wait until #1495 is resolved?

@lenhard
Copy link
Member

lenhard commented Jan 24, 2017

The JabRef parser has been implemented 15 years back in time and evolved since then :) It is PushbackReader, essentially a queue that reads the file character-wise, possibly pushing characters that have been read back to the head of the queue, if needed.

You get to the error console via the menu: Help -> Show error console. To me, the error locks more like a bug in JabRef. Even if the group code is faulty, the GroupsParser should not fail with a NumberFormatException. It does not tell you the exact position of that triggers the error. But it might help us in debugging and finding this position. This is the reason why I posted the stacktrace.

Regarding line wrapping in the groups format, I would again need the advice of @tobiasdiez

My hope would be that we just find the error in the current parsing that triggers this exception and after that your generated groups work again.

@retorquere
Copy link
Author

OK so in the error console I now see this same error message, but the status remains set at "Importing in unknown format". I assume that this does not actually mean the import is still running. I'll just wait for now before doing anything in BBT.

Except the groups format perhaps. Should I be generating the new format? At the very least I should be parsing it.

@lenhard
Copy link
Member

lenhard commented Jan 24, 2017

You can go ahead with implementing the new format. We will keep support for reading the old format for quite some time (and will investigate this error, though I cannot guarantee the time frame). But we will definitely not roll back to the old format.

Newer versions of JabRef exclusively write out the new format, so support for parsing it makes sense. You can also go ahead with generating it, if you do not mind its drawbacks. Despite those, the current format has been stable for quite a while and we will keep it stable for quite a while longer. My hope is that we can just resolve the duplicates bug inside the application code (without touching the format).

@retorquere
Copy link
Author

So how are multiple groups separated in the current implementation? Comma, semicolon?

@lenhard
Copy link
Member

lenhard commented Jan 24, 2017

Extending the example above:

@Article{somekey,
  author    = {Author Author},
  title     = {Title Title},
  groups    = {group, anothergroup},
}

@Comment{jabref-meta: groupstree:
0 AllEntriesGroup:;
1 ExplicitGroup:group\;0\;;
1 ExplicitGroup:anothergroup\;0\;;
}

Ergo: Semicolon.

The safest way to get information about the group format is probably to install jabref and see what it serializes ;-)

@retorquere
Copy link
Author

That's what I did before and now it fails to parse in Jabref 😉 . I meant to ask how to separate the groups in the reference so that's a comma. Parsing should already work (it was a relatively minor change), writing out is en route.

@tobiasdiez
Copy link
Member

I think I found the error why your example bib file does not import correctly. PR is coming hopefully this evening.

@tobiasdiez tobiasdiez self-assigned this Jan 24, 2017
@retorquere
Copy link
Author

Alright, next release of BBT will have format 4 (I suppose) parsing & writing, and will still parse format 3. I can see why you chose to go this way -- the changes were minimal.

@retorquere
Copy link
Author

How should the groups field be treated? Literal list, Literal field, Verbatim field, Separated value field? I've just fiddled around a bit and I can create a group called a,b and another group a and b, and the references assigned to a,b get groups = {a,b}. BBT would parse this as two separate groups.

@retorquere
Copy link
Author

Perhaps a Name list would be better, as its behavior is a little more well-defined. a,b would then become groups = {a,b} or groups = {{a,b}} (which would be equivalent from BBTs point of view), but a reference belonging to both a,b, a and and would become groups = {a,b and a and {and}}.

@tobiasdiez
Copy link
Member

So the problem was that some group names in the example bib file contained non-escaped backlashes. For example, ExplicitGroup:slit\. The correct way to escape is with 4 backslashes, i.e., ExplicitGroup:slit\\\\\;0\;mertsch_slit2_2007\;;. Don't ask me why 4 backslashes and not just 2 ( @lenhard do you have an idea why?).

With #2488 a proper error message should be shown if the "damaged" database is imported.

@retorquere
Copy link
Author

I think I may know - looks like the jabref groups are just lists of strings, which are encoded into a single string by escaping backslashes and semicolons, and then joined by a semicolon. The hierarchy is then treated the same, which leads to the double escaping. I'll make the change somewhere today - should be easy.

@retorquere
Copy link
Author

OK, I think I have this fixed now.

@retorquere
Copy link
Author

Yeah, I have tests confirming the fix. My implementation of the groups format was pretty whack, that should now be fixed.

What is the meaning of the empty field at the end of an ExplicitGroup BTW?

@tobiasdiez
Copy link
Member

If I understand the code correctly, then the last field contained the list of referenced entries. This is now empty since newer versions store this information directly in the entry (as groups field) but kept for semantic reasons.

@retorquere
Copy link
Author

As far as I can tell the empty last cell was always there, also in groupstree format 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs or reports that are very likely to be bugs groups type: documentation
Projects
None yet
Development

No branches or pull requests

4 participants