Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latex to Unicode in author field fails #2063

Closed
grimes2 opened this issue Sep 25, 2016 · 14 comments
Closed

Latex to Unicode in author field fails #2063

grimes2 opened this issue Sep 25, 2016 · 14 comments
Assignees
Labels
component: cleanup-ops [outdated] type: bug Confirmed bugs or reports that are very likely to be bugs

Comments

@grimes2
Copy link
Contributor

grimes2 commented Sep 25, 2016

JabRef 3.7dev--snapshot--2016-09-25--master--40df60b
windows 10 10.0 amd64
Java 1.8.0_101

  1. author field in entry editor: Sto{"{\i}}anova, Ivanka
  2. author field in main table: Stoanova, Ivanka
  3. should be: Stoïanova, Ivanka
@Siedlerchr
Copy link
Member

Confirmed. Will take a look at it.

@Siedlerchr Siedlerchr self-assigned this Sep 26, 2016
@tobiasdiez tobiasdiez added the [outdated] type: bug Confirmed bugs or reports that are very likely to be bugs label Sep 26, 2016
@Siedlerchr
Copy link
Member

Siedlerchr commented Sep 26, 2016

Ah, I found the problem:

For some reason I don't know there exist 2 Variants of these character, latin small letter i with diaeresis (U+00EF)
ftp://ftp.dante.de/tex-archive/macros/xetex/latex/xecjk/xunicode-symbols.pdf

However, JabRef only supports the first one: \^{i}

I will try to edit the second character as well

@stefan-kolb
Copy link
Member

@oscargus is the expert here 😄

@Siedlerchr
Copy link
Member

The problem is theoretically easy to fix, but practically we have the problem that the mappings are stored in a HashMap and we have no chance to add a double entry. Fuu.. I hate HashMaps sometimes.

@koppor
Copy link
Member

koppor commented Sep 26, 2016

Refs JabRef#145, #1215

@Siedlerchr
Copy link
Member

@koppor Sure. I found that piece of code. However,

00EF ï has two possible mappings: \"{i} or \"{\i} (notice the slash before the i)
The latter one is used here, latest version (inlcuding unicode8)
https://www.w3.org/2003/entities/2007xml/unicode.xml (warning, file is around 5MB)

Both versions are valid. The question is: Do we find a solution to support multiple variants or do we stick with just one variant and then do nothing?

The current behaviour is more or less removing the character if no conversion possible

@tobiasdiez
Copy link
Member

I think you can just add a new entry to the CONVERSION_LIST. In the Unicode -> Latex direction only one of these entries is kept, but both should be added to the Latex -> Unicode map.

@Siedlerchr
Copy link
Member

Siedlerchr commented Sep 26, 2016

That was may inital idea, too: However the point is, we have a HashMap later on:

public static final Map<String, String> LATEX_UNICODE_CONVERSION_MAP = new HashMap<>();

public static final Map<String, String> LATEX_UNICODE_CONVERSION_MAP = new HashMap<>();

And unicode Number is the key.. Problem: We have now 1 Unicode Number -> 2 Values.
So one would need a MultiMap or whatever

@tobiasdiez
Copy link
Member

But this map has the latex code as the key and unicode as the value, see

LATEX_UNICODE_CONVERSION_MAP.put(strippedLaTeX, unicodeSymbol);
. So there is no problem if you have different latex codes mapped to a single unicode.

@Siedlerchr
Copy link
Member

Okay, I tried the approach using @tobiasdiez approach and in theory it should work. However, through debugging I noticed that the CleanLatex String methods removes all kind of slashes and braces which result in getting the same strippedLatex key...

  {"239", "iuml", "{\\\"{i}}"}, // latin small letter i with diaeresis,
            //
  {"239", "iuml", "{\\\"{\\i}}"}, // latin small letter i with diaeresis,
            //                                 U+00EF ISOlat1

 private static String cleanLaTeX(String escapedString) {
        // Get rid of \{}$ from the LaTeX-string
        return escapedString.replaceAll("[\\\\\\{\\}\\$]", "");
    }

@oscargus
Copy link
Contributor

Yes, the key here used in the lookup will/should be "i, so both variants should work (the reason being that {\"{i}}, \"{i}, \"{\i}, {\"{{\i}}} etc should work). However, there is also some special handling of a few of the accents to allow writing e.g. \"i. Not sure if that may be the issue here...

@lenhard
Copy link
Member

lenhard commented Jan 13, 2017

Related #2458

I had the very same problem in the above issue: The cleanLaTeX method stripping away important information. Ultimately the problem is not so much the maps, but rather the way in which LaTexToUnicode queries these maps.

@lenhard
Copy link
Member

lenhard commented Feb 10, 2017

We integrated a new library for performing the conversion in #2532: latex2unicode. As far as I have tested in the UI, the problems described here are solved now, so I am closing this issue.

Feel free to reopen in case the problem reappears!

@lenhard lenhard closed this as completed Feb 10, 2017
@lenhard
Copy link
Member

lenhard commented Feb 10, 2017

Is now also tested with 9eef09c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: cleanup-ops [outdated] type: bug Confirmed bugs or reports that are very likely to be bugs
Projects
None yet
Development

No branches or pull requests

7 participants