Switch to latex2unicode lib instead of own handling #2465

Siedlerchr · 2017-01-13T17:02:55Z

This would solve some serious issues:

Idea: Use https://github.com/tomtung/latex2unicode
Provides a simple extendable grammar.

We then would only need a conversion to html

Refs #2063 #207 #1252

tobiasdiez · 2017-01-13T19:56:08Z

lenhard · 2017-02-03T12:04:57Z

I set up a working implementation using latex2unicode in the latex2unicode branch. Here is the diff: https://github.com/JabRef/jabref/compare/latex2unicode

I had some intial setup problems, but the author helped me to resolve that in tomtung/latex2unicode#1

The library is really easy to use and would be great in terms of code complexity. It also fixes a few conversion problems we are having. However, here are the results from the benchmark:

On current master:

Benchmark                             Mode  Cnt      Score       Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20  100036.407 ▒ 1406.243  ops/s
Benchmarks.parallelSearch            thrpt   20     733.635 ▒   74.242  ops/s
Benchmarks.parse                     thrpt   20      26.130 ▒    0.653  ops/s
Benchmarks.search                    thrpt   20     374.335 ▒   17.097  ops/s

Using latex2unicode

Benchmark                             Mode  Cnt      Score       Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20    1936.683 ▒   75.321  ops/s
Benchmarks.parallelSearch            thrpt   20     744.081 ▒   78.945  ops/s
Benchmarks.parse                     thrpt   20      23.939 ▒    1.574  ops/s
Benchmarks.search                    thrpt   20     402.994 ▒   16.057  ops/s

This is a drop in op/s by two orders of magnitude. It is also noticeable when starting up JabRef. I hate to say this since I really like the library, but it looks as if we cannot use it due to performance.

tobiasdiez · 2017-02-03T13:14:38Z

Did you tried the benchmark without the normalizer and the replacement of tildes? Maybe these post-and preconversations are a problem.
But probably the preformance problems comes from the fact that latex2unicode really uses a grammar to parse the latex code. In this case, I would propose to merge both projects:

reuse our parser
but convert similar to latex2unicode, in particular use something similar to the helper methods https://github.com/tomtung/latex2unicode/tree/master/src/main/scala/com/github/tomtung/latex2unicode/helper to make the code more readable.

lenhard · 2017-02-03T15:13:27Z

I did not try the benchmark without normalizer and replacement of tildes. Could you maybe check out the branch and do a benchmark? Maybe this is just a weird configuration issue on my system (although I don't think so).

The single regular expression with the tildes will not make a difference. We have several of those in our current converter. The normalization cannot be avoided, otherwise the output of latex2unicode will not be usable for JabRef. This is discussed in the issue linked above.

I guess (that is really just a guess) that the performance drop comes from the fact that latex2unicode is implemented in Scala and we are using it in Java. No matter how close Scala is integrated with Java, the friction inevitably will cause an overhead.

I aggree that a structure similar to the implementation of latex2unicode would be desirable, but we have to acknowledge that this corresponds to a rewrite. This should not be done purely for the sake of beauty, but when fixing one of the issues linked above.

lenhard · 2017-02-03T16:55:36Z

@tomtung wants to do some optimization in latex2unicode, see: tomtung/latex2unicode#1 (comment)

We will try a new version of the library when it is available.

lenhard · 2017-02-09T09:05:25Z

So @tomtung did some optimization and I did a new performance benchmark, the results are as follows:

Benchmark                             Mode  Cnt      Score      Error  Units
Benchmarks.latexToUnicodeConversion  thrpt   20  72417.919 ▒ 1906.792  ops/s
Benchmarks.parallelSearch            thrpt   20    570.149 ▒   52.298  ops/s
Benchmarks.parse                     thrpt   20     19.891 ▒    2.068  ops/s
Benchmarks.search                    thrpt   20    287.523 ▒   46.959  ops/s

This is a huge boost from the previous version, although it is not up to the level that we have with our own conversion. Nevertheless, I would strongly suggest to go for it. I did not notice a significantly longer delay in opening JabRef, even when opening Aegits gigantic file.

What I see as critical is that the library gives us a much more complete Latex to unicode conversion that fixes a number of bugs which we currently have listed in our issue tracker. And it even has potential for much more than that. For instance, we can now correctly convert italics (which cannot easily be displayed, so there is not much we can do with it right now, but anyway). The code can be viewed as part of #2532.

@JabRef/developers What do you think?

Siedlerchr · 2017-02-09T09:23:29Z

I would vote for a go. The performance seems to be acceptable Am 09.02.2017 10:05 vorm. schrieb "Jörg Lenhard" <notifications@github.com>:

…

So @tomtung <https://github.com/tomtung> did some optimization and I did a new performance benchmark, the results are as follows: Benchmark Mode Cnt Score Error Units Benchmarks.latexToUnicodeConversion thrpt 20 72417.919 ▒ 1906.792 ops/s Benchmarks.parallelSearch thrpt 20 570.149 ▒ 52.298 ops/s Benchmarks.parse thrpt 20 19.891 ▒ 2.068 ops/s Benchmarks.search thrpt 20 287.523 ▒ 46.959 ops/s This is a huge boost from the previous version, although it is not up to the level that we have with our own conversion. Nevertheless, I would strongly suggest to go for it. I did not notice a significantly longer delay in opening JabRef, even when opening Aegits gigantic file. What I see as critical is that the library gives us a much more complete Latex to unicode conversion that fixes a number of bugs which we currently have listed in our issue tracker. And it even has potential for much more than that. For instance, we can now correctly convert italics (which cannot easily be displayed, so there is not much we can do with it right now, but anyway). The code can be viewed as part of #2532 <#2532>. @JabRef/developers <https://github.com/orgs/JabRef/teams/developers> What do you think? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2465 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATi5KMwniZlFmHWkaMa7hPmI4nFJed2ks5ratbWgaJpZM4LjFKd> .

lenhard · 2017-02-10T09:05:51Z

#2532 is merged, so this can be closed.

Siedlerchr added architecture component: cleanup-ops labels Jan 13, 2017

This was referenced Jan 30, 2017

Entry preview: wrong display of apostrophe (treated as special character) #2500

Closed

Problems in setting up latex2unicode tomtung/latex2unicode#1

Closed

lenhard mentioned this issue Feb 3, 2017

Groups invisible and scrolling does not work (Regression 3.8) #2404

Closed

AEgit mentioned this issue Feb 3, 2017

'APOSTROPHE' (U+0027) not well displayed in the main table #2516

Closed

lenhard mentioned this issue Feb 9, 2017

Switch to Latex2unicode #2532

Merged

3 tasks

stefan-kolb closed this as completed Feb 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to latex2unicode lib instead of own handling #2465

Switch to latex2unicode lib instead of own handling #2465

Siedlerchr commented Jan 13, 2017

tobiasdiez commented Jan 13, 2017

lenhard commented Feb 3, 2017

tobiasdiez commented Feb 3, 2017

lenhard commented Feb 3, 2017

lenhard commented Feb 3, 2017

lenhard commented Feb 9, 2017

Siedlerchr commented Feb 9, 2017 via email

lenhard commented Feb 10, 2017

Switch to latex2unicode lib instead of own handling #2465

Switch to latex2unicode lib instead of own handling #2465

Comments

Siedlerchr commented Jan 13, 2017

tobiasdiez commented Jan 13, 2017

lenhard commented Feb 3, 2017

tobiasdiez commented Feb 3, 2017

lenhard commented Feb 3, 2017

lenhard commented Feb 3, 2017

lenhard commented Feb 9, 2017

Siedlerchr commented Feb 9, 2017 via email

lenhard commented Feb 10, 2017