Latex to Unicode in author field fails #2063

grimes2 · 2016-09-25T18:38:42Z

JabRef 3.7dev--snapshot--2016-09-25--master--40df60b
windows 10 10.0 amd64
Java 1.8.0_101

author field in entry editor: Sto{"{\i}}anova, Ivanka
author field in main table: Stoanova, Ivanka
should be: Stoïanova, Ivanka

Siedlerchr · 2016-09-26T11:08:33Z

Confirmed. Will take a look at it.

Siedlerchr · 2016-09-26T11:35:15Z

Ah, I found the problem:

For some reason I don't know there exist 2 Variants of these character, latin small letter i with diaeresis (U+00EF)
ftp://ftp.dante.de/tex-archive/macros/xetex/latex/xecjk/xunicode-symbols.pdf

However, JabRef only supports the first one: \^{i}

I will try to edit the second character as well

stefan-kolb · 2016-09-26T12:08:33Z

@oscargus is the expert here 😄

Siedlerchr · 2016-09-26T16:31:15Z

The problem is theoretically easy to fix, but practically we have the problem that the mappings are stored in a HashMap and we have no chance to add a double entry. Fuu.. I hate HashMaps sometimes.

koppor · 2016-09-26T17:23:29Z

Refs JabRef#145, #1215

Siedlerchr · 2016-09-26T18:33:28Z

@koppor Sure. I found that piece of code. However,

00EF ï has two possible mappings: \"{i} or \"{\i} (notice the slash before the i)
The latter one is used here, latest version (inlcuding unicode8)
https://www.w3.org/2003/entities/2007xml/unicode.xml (warning, file is around 5MB)

Both versions are valid. The question is: Do we find a solution to support multiple variants or do we stick with just one variant and then do nothing?

The current behaviour is more or less removing the character if no conversion possible

tobiasdiez · 2016-09-26T20:29:23Z

I think you can just add a new entry to the CONVERSION_LIST. In the Unicode -> Latex direction only one of these entries is kept, but both should be added to the Latex -> Unicode map.

Siedlerchr · 2016-09-26T20:36:45Z

That was may inital idea, too: However the point is, we have a HashMap later on:

jabref/src/main/java/net/sf/jabref/model/strings/HTMLUnicodeConversionMaps.java

Line 852 in ee688c4

    
           public static final Map<String, String> LATEX_UNICODE_CONVERSION_MAP = new HashMap<>();

public static final Map<String, String> LATEX_UNICODE_CONVERSION_MAP = new HashMap<>();

And unicode Number is the key.. Problem: We have now 1 Unicode Number -> 2 Values.
So one would need a MultiMap or whatever

tobiasdiez · 2016-09-26T21:06:42Z

But this map has the latex code as the key and unicode as the value, see

jabref/src/main/java/net/sf/jabref/model/strings/HTMLUnicodeConversionMaps.java

Line 873 in ee688c4

LATEX_UNICODE_CONVERSION_MAP.put(strippedLaTeX, unicodeSymbol);

. So there is no problem if you have different latex codes mapped to a single unicode.

Siedlerchr · 2016-09-27T19:42:38Z

Okay, I tried the approach using @tobiasdiez approach and in theory it should work. However, through debugging I noticed that the CleanLatex String methods removes all kind of slashes and braces which result in getting the same strippedLatex key...

  {"239", "iuml", "{\\\"{i}}"}, // latin small letter i with diaeresis,
            //
  {"239", "iuml", "{\\\"{\\i}}"}, // latin small letter i with diaeresis,
            //                                 U+00EF ISOlat1

 private static String cleanLaTeX(String escapedString) {
        // Get rid of \{}$ from the LaTeX-string
        return escapedString.replaceAll("[\\\\\\{\\}\\$]", "");
    }

oscargus · 2016-09-28T18:35:56Z

Yes, the key here used in the lookup will/should be "i, so both variants should work (the reason being that {\"{i}}, \"{i}, \"{\i}, {\"{{\i}}} etc should work). However, there is also some special handling of a few of the accents to allow writing e.g. \"i. Not sure if that may be the issue here...

lenhard · 2017-01-13T16:46:27Z

Related #2458

I had the very same problem in the above issue: The cleanLaTeX method stripping away important information. Ultimately the problem is not so much the maps, but rather the way in which LaTexToUnicode queries these maps.

lenhard · 2017-02-10T09:09:22Z

We integrated a new library for performing the conversion in #2532: latex2unicode. As far as I have tested in the UI, the problems described here are solved now, so I am closing this issue.

Feel free to reopen in case the problem reappears!

lenhard · 2017-02-10T10:51:44Z

Is now also tested with 9eef09c

Siedlerchr self-assigned this Sep 26, 2016

tobiasdiez added the [outdated] type: bug Confirmed bugs or reports that are very likely to be bugs label Sep 26, 2016

Siedlerchr added the [outdated] type: question label Sep 26, 2016

tobiasdiez added component: cleanup-ops and removed [outdated] type: question labels Nov 11, 2016

Siedlerchr mentioned this issue Jan 13, 2017

Switch to latex2unicode lib instead of own handling #2465

Closed

lenhard mentioned this issue Feb 9, 2017

Switch to Latex2unicode #2532

Merged

3 tasks

lenhard closed this as completed Feb 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latex to Unicode in author field fails #2063

Latex to Unicode in author field fails #2063

grimes2 commented Sep 25, 2016

Siedlerchr commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016 •

edited

Loading

stefan-kolb commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016

koppor commented Sep 26, 2016 •

edited

Loading

Siedlerchr commented Sep 26, 2016

tobiasdiez commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016 •

edited

Loading

tobiasdiez commented Sep 26, 2016

Siedlerchr commented Sep 27, 2016

oscargus commented Sep 28, 2016

lenhard commented Jan 13, 2017 •

edited

Loading

lenhard commented Feb 10, 2017

lenhard commented Feb 10, 2017

Latex to Unicode in author field fails #2063

Latex to Unicode in author field fails #2063

Comments

grimes2 commented Sep 25, 2016

Siedlerchr commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016 • edited Loading

stefan-kolb commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016

koppor commented Sep 26, 2016 • edited Loading

Siedlerchr commented Sep 26, 2016

tobiasdiez commented Sep 26, 2016

Siedlerchr commented Sep 26, 2016 • edited Loading

tobiasdiez commented Sep 26, 2016

Siedlerchr commented Sep 27, 2016

oscargus commented Sep 28, 2016

lenhard commented Jan 13, 2017 • edited Loading

lenhard commented Feb 10, 2017

lenhard commented Feb 10, 2017

Siedlerchr commented Sep 26, 2016 •

edited

Loading

koppor commented Sep 26, 2016 •

edited

Loading

Siedlerchr commented Sep 26, 2016 •

edited

Loading

lenhard commented Jan 13, 2017 •

edited

Loading