Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to latex2unicode lib instead of own handling #2465

Closed
Siedlerchr opened this issue Jan 13, 2017 · 8 comments
Closed

Switch to latex2unicode lib instead of own handling #2465

Siedlerchr opened this issue Jan 13, 2017 · 8 comments

Comments

@Siedlerchr
Copy link
Member

This would solve some serious issues:

Idea: Use https://github.com/tomtung/latex2unicode
Provides a simple extendable grammar.

We then would only need a conversion to html

Refs #2063 #207 #1252

@tobiasdiez
Copy link
Member

Or: #1252

@lenhard
Copy link
Member

lenhard commented Feb 3, 2017

I set up a working implementation using latex2unicode in the latex2unicode branch. Here is the diff: https://github.com/JabRef/jabref/compare/latex2unicode

I had some intial setup problems, but the author helped me to resolve that in tomtung/latex2unicode#1

The library is really easy to use and would be great in terms of code complexity. It also fixes a few conversion problems we are having. However, here are the results from the benchmark:

On current master:

Benchmark                             Mode  Cnt      Score       Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20  100036.407 ▒ 1406.243  ops/s
Benchmarks.parallelSearch            thrpt   20     733.635 ▒   74.242  ops/s
Benchmarks.parse                     thrpt   20      26.130 ▒    0.653  ops/s
Benchmarks.search                    thrpt   20     374.335 ▒   17.097  ops/s

Using latex2unicode

Benchmark                             Mode  Cnt      Score       Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20    1936.683 ▒   75.321  ops/s
Benchmarks.parallelSearch            thrpt   20     744.081 ▒   78.945  ops/s
Benchmarks.parse                     thrpt   20      23.939 ▒    1.574  ops/s
Benchmarks.search                    thrpt   20     402.994 ▒   16.057  ops/s

This is a drop in op/s by two orders of magnitude. It is also noticeable when starting up JabRef. I hate to say this since I really like the library, but it looks as if we cannot use it due to performance.

@tobiasdiez
Copy link
Member

Did you tried the benchmark without the normalizer and the replacement of tildes? Maybe these post-and preconversations are a problem.
But probably the preformance problems comes from the fact that latex2unicode really uses a grammar to parse the latex code. In this case, I would propose to merge both projects:

@lenhard
Copy link
Member

lenhard commented Feb 3, 2017

I did not try the benchmark without normalizer and replacement of tildes. Could you maybe check out the branch and do a benchmark? Maybe this is just a weird configuration issue on my system (although I don't think so).

The single regular expression with the tildes will not make a difference. We have several of those in our current converter. The normalization cannot be avoided, otherwise the output of latex2unicode will not be usable for JabRef. This is discussed in the issue linked above.

I guess (that is really just a guess) that the performance drop comes from the fact that latex2unicode is implemented in Scala and we are using it in Java. No matter how close Scala is integrated with Java, the friction inevitably will cause an overhead.

I aggree that a structure similar to the implementation of latex2unicode would be desirable, but we have to acknowledge that this corresponds to a rewrite. This should not be done purely for the sake of beauty, but when fixing one of the issues linked above.

@lenhard
Copy link
Member

lenhard commented Feb 3, 2017

@tomtung wants to do some optimization in latex2unicode, see: tomtung/latex2unicode#1 (comment)

We will try a new version of the library when it is available.

@lenhard
Copy link
Member

lenhard commented Feb 9, 2017

So @tomtung did some optimization and I did a new performance benchmark, the results are as follows:

Benchmark                             Mode  Cnt      Score      Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20  72417.919 ▒ 1906.792  ops/s
Benchmarks.parallelSearch            thrpt   20    570.149 ▒   52.298  ops/s
Benchmarks.parse                     thrpt   20     19.891 ▒    2.068  ops/s
Benchmarks.search                    thrpt   20    287.523 ▒   46.959  ops/s

This is a huge boost from the previous version, although it is not up to the level that we have with our own conversion. Nevertheless, I would strongly suggest to go for it. I did not notice a significantly longer delay in opening JabRef, even when opening Aegits gigantic file.

What I see as critical is that the library gives us a much more complete Latex to unicode conversion that fixes a number of bugs which we currently have listed in our issue tracker. And it even has potential for much more than that. For instance, we can now correctly convert italics (which cannot easily be displayed, so there is not much we can do with it right now, but anyway). The code can be viewed as part of #2532.

@JabRef/developers What do you think?

@Siedlerchr
Copy link
Member Author

Siedlerchr commented Feb 9, 2017 via email

@lenhard
Copy link
Member

lenhard commented Feb 10, 2017

#2532 is merged, so this can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants