Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case and diacritic folding for expression operators #4136

Closed
lucaswoj opened this issue Feb 1, 2017 · 17 comments
Closed

Case and diacritic folding for expression operators #4136

lucaswoj opened this issue Feb 1, 2017 · 17 comments
Labels
cross-platform 📺 Requires coordination with Mapbox GL Native (style specification, rendering tests, etc.) feature 🍏

Comments

@lucaswoj
Copy link
Contributor

lucaswoj commented Feb 1, 2017

From @1ec5 on October 21, 2016 10:26

As discussed in mapbox/mapbox-gl-native#6781, it can often be necessary for a filter to perform a case-insensitive or diacritic-insensitive comparison. Variations on the existing operators should be added to support these options.

Here’s an example design (but by no means the best):

[
  {
    "operator": "in",
    "case-sensitive": false,
    "diacritic-sensitive": false,
  },
  "key",
  "value 1",
  "value 2"
]

Implementing case-insensitive comparisons should be trivial on all the platforms supported by Mapbox GL. On the other hand, while there are fine options for diacritic folding on the native platforms, JavaScript would have to rely on a library for diacritic-insensitive comparisons.

/cc @incanus @lucaswoj @jfirebaugh

Copied from original issue: mapbox/mapbox-gl-style-spec#548

@1ec5 1ec5 added the cross-platform 📺 Requires coordination with Mapbox GL Native (style specification, rendering tests, etc.) label Feb 3, 2017
@mollymerp
Copy link
Contributor

TIL what diacritic means

@incanus
Copy link
Contributor

incanus commented Feb 8, 2017

Everyone's a diacritic.

@1ec5
Copy link
Contributor

1ec5 commented May 18, 2017

Over in #4715 (comment), we’re discussing a potential syntax for equality comparisons in expressions. Although this issue talks about filters, getting case and diacritic folding into expressions would address some of the main use cases for this functionality (for bilingual labeling in particular, but not for highlighting “fuzzy-matched” search results).

#4715 (comment) asks whether we should incorporate case and diacritic folding into the proposal as some sort of modifier on equality comparisons, or whether we should rely on the style author to run each operand through functions like lowercase or strip-diacritics beforehand.

Is a case- and diacritic-folded equality comparison equivalent to an equality comparison between two normalized strings, or are there aspects to folding that require both original strings as context? To put it another way, are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

/cc @anandthakker @kkaefer @apendleton @jcsg

@1ec5
Copy link
Contributor

1ec5 commented May 18, 2017

are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

In the absence of a fix for #3999, there are some problem areas caused by a lack of locale information. Should strip-diacritics add a tittle to ı (found in Turkish) in order for strip-diacritics("Kırşehir") == "Kirsehir"? Should lowercase("GROSSER STERN") == "sroßer stern"? The equality operator itself could be tolerant of differences in case or diacritics in these cases.

@jcsg
Copy link

jcsg commented May 19, 2017

There is a human- and machine-readable file published by Unicode here that details all the special casing operations that we could catch. They explicitly use two-letter language codes for some of these rules, so locale specific information that uses those codes could transparently connect to the Unicode standard.

The rules listed there cover the cases that you discussed in the previous comment, and list special cases for ligatures (Latin and Armenian), for where there is no uppercase precomposed character (all Greek and Latin), conditional mappings that depend on position within a word (all Greek), and a category of "Language-Sensitive Mappings" which contain rules for characters in Lithuanian, Turkish, and Azeri (Azerbaijani).

The comprehensive list of all other "normal" casing conventions are in:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
with a guide to understanding that file format here:
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

The standard UnicodeData.txt file also gives information on the character decomposition mapping. I need to do more research on these decomposition mappings, but a good starting point may be to try to associate any Latin-ish precomposed character that we'd want to de-diacritize with only the strict ASCII letters that we find in the character's decomposition mappings. (As far as I could tell, it wasn't clear that unidecode actually used Unicode standards for reference, although it also does de-diacritization.)

If we implement our own lowercasing and de-diacritization functions, I think we should do so according to the current Unicode standards themselves unless something prevents that.

cc: @boblannon

@boblannon
Copy link

on geocoding, we've done away with unidecode altogether. we currently have a dedicated script, using manually curated rules written mostly by @apendleton: https://github.com/mapbox/carmen/blob/master/lib/util/remove-diacritics.js

also worth knowing, though: Javascript has character decomposition built in: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

@boblannon
Copy link

....so you could have

// Compatibly-decomposed (NFKD)

// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize('NFKD'); // '\u0073\u0323\u0307'

and then a separate step that stripped out the combining dot characters.

@kkaefer
Copy link
Member

kkaefer commented May 19, 2017

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs. It also supports diacritic folding ("Unaccenting"): https://bitbucket.org/alekseyt/nunicode#markdown-header-unaccenting as well as case folding.

@1ec5
Copy link
Contributor

1ec5 commented May 19, 2017

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with. Assuming we don’t, would we get more reliable results by making the equality operator optionally aware of diacritic folding than by composing equality with diacritic-stripping? Perhaps that would allow us multiple levels of canonicalization. For example:

["==", "Þingvellir", "Thingvellir", "diacritic-insensitive"]
["==", "Phơ 54", "Phở 54", "diacritic-insensitive"]

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs.

Nunicode is similar in concept to unidecode: it attempts to strip diacritics without accounting for language-specific rules.

Do we have a nunicode analogue in JavaScript? Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

@1ec5 1ec5 changed the title Case and diacritic folding for filter operations Case and diacritic folding for expression operators Oct 26, 2017
@1ec5
Copy link
Contributor

1ec5 commented Oct 26, 2017

Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

Case and diacritic folding can be implemented in JavaScript (for GL JS) like this:

Intl.Collator({ sensitivity: 'base', usage: 'search' }).compare('ä', 'a') === 0

The same functionality can be implemented in Objective-C (for iOS and macOS) like this in string_nsstring.mm:

[@"ä" compare:@"a" options:NSCaseInsensitiveSearch | NSDiacriticInsensitiveSearch range:NSMakeRange(0, @"ä".length) locale:nil] == NSOrderedSame;

and in Java (for Android) like this:

Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
collator.equals("ä", "a");

@kkaefer
Copy link
Member

kkaefer commented Nov 9, 2017

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with.

This is still a valid concern, e.g. in German, diacritic folding for umlauts typically means adding an e after the vowel (München -> Muenchen), while it doesn't in Swedish (Malmö -> Malmo).

@1ec5
Copy link
Contributor

1ec5 commented Nov 10, 2017

Agreed. The assumption in #4136 (comment) is that the strings in question are in the same language as the browser/system. But that’s a difficult assumption to make across a world map.

Supposing we implement one of the proposals in #3999 (comment) for telling GL a given source property’s language, I think it would still be desirable to focus on enhancing the equality operator rather than composing it with a diacritic-stripping operator. (In the code examples above, all three APIs accept a language identifier.) After all, the vector tiles aren’t the only places that an expression might get its strings from; strings could also be embedded literally or come from elsewhere in runtime styling code.

@ChrisLoer
Copy link
Contributor

ChrisLoer commented Mar 1, 2018

I'm currently evaluating an implementation of diacritic-insensitive equality (but not diacritic stripping, or even diacritic-insensitive contains/begins-with/etc. -- will those be important?) that takes a locale as an argument (could default to current locale).

Before committing to a JS implementation, I did a survey of how we could implement this on all our supported platforms -- we want to avoid as much as possible any subtle differences that could cause map rendering to change from one platform to the next, and we also want to avoid expensive bundling of collation rules:

  • JS: Intl.Collator. We don't control the underlying implementation here, but it looks to me on quick inspection like both Chromium and Firefox are using ICU in their underlying implementations.
  • iOS/macOS: Diacritic-insensitive NSPredicate. I don't know for sure, but I suspect that under the hood the Apple implementation is based on ICU4C. Edit: @1ec5 already pointed out there's a simpler way to do this with NSDiacriticInsensitiveSearch NSStringCompareOption..
  • Android: java.text.Collator with SECONDARY strength. I believe the java.text implementation is meant to use the same logic as ICU4J, which should stay in sync with ICU4C.
  • Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.
  • Other (linux/glfw/node/etc.): Statically linked ICU::Collator with SECONDARY strength. The reason for not using this as the default across gl-native is that it requires bundling collation data with the app.

@brunoabinader
Copy link
Member

Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.

Unfortunately, Qt does not provide a mechanism for ignoring diacritics. Qt uses ICU internally e.g. for QCollator but does not expose it as part of its public APIs. Some workarounds are explored in https://stackoverflow.com/questions/14009522/how-to-remove-accents-diacritic-marks-from-a-string-in-qt but there is no silver bullet. One workaround is to add a custom lookup table for searching/replacing characters with their canonical/compatible versions.

@1ec5
Copy link
Contributor

1ec5 commented Mar 21, 2018

I keep coming around to the idea that what we need most is locale-aware comparisons, almost orthogonal to a way to explicitly ignore diacritics when comparing. The latter would effectively be a generic diacritic-stripping operation that gives you a Boolean instead of the stripped string, and it would suffer from the same problems:

  • A lack of support on some platforms (Qt)
  • Incorrect behavior for languages with complex rules – for example, in Vietnamese, xoa ≠ xóa = xoá and Đ ≠ D+-
  • Greater verbosity compared to specifying a single collation locale

The upside is that you could use a diacritic-insensitive comparison in a language-agnostic stylesheet, but in some sense a language is always involved when comparing strings, even if it’s just the default C locale.

@ChrisLoer
Copy link
Contributor

I keep coming around to the idea that what we need most is locale-aware comparisons

That is the direction we're heading, see #6270 (comment) in the GL JS "Collator" PR.

@ChrisLoer
Copy link
Contributor

Closing this as fixed with #6270, which adds support for diacritic-insensitve comparisons. We don't have support for a diacritic "stripping" transliterator, which is harder to define and hopefully not necessary for the most important use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cross-platform 📺 Requires coordination with Mapbox GL Native (style specification, rendering tests, etc.) feature 🍏
Projects
None yet
Development

No branches or pull requests

10 participants