Case and diacritic folding for expression operators #4136

lucaswoj · 2017-02-01T23:24:58Z

From @1ec5 on October 21, 2016 10:26

As discussed in mapbox/mapbox-gl-native#6781, it can often be necessary for a filter to perform a case-insensitive or diacritic-insensitive comparison. Variations on the existing operators should be added to support these options.

Here’s an example design (but by no means the best):

[
  {
    "operator": "in",
    "case-sensitive": false,
    "diacritic-sensitive": false,
  },
  "key",
  "value 1",
  "value 2"
]

Implementing case-insensitive comparisons should be trivial on all the platforms supported by Mapbox GL. On the other hand, while there are fine options for diacritic folding on the native platforms, JavaScript would have to rely on a library for diacritic-insensitive comparisons.

/cc @incanus @lucaswoj @jfirebaugh

Copied from original issue: mapbox/mapbox-gl-style-spec#548

mollymerp · 2017-02-08T01:50:56Z

TIL what diacritic means

incanus · 2017-02-08T01:53:37Z

Everyone's a diacritic.

1ec5 · 2017-05-18T21:46:14Z

Over in #4715 (comment), we’re discussing a potential syntax for equality comparisons in expressions. Although this issue talks about filters, getting case and diacritic folding into expressions would address some of the main use cases for this functionality (for bilingual labeling in particular, but not for highlighting “fuzzy-matched” search results).

#4715 (comment) asks whether we should incorporate case and diacritic folding into the proposal as some sort of modifier on equality comparisons, or whether we should rely on the style author to run each operand through functions like lowercase or strip-diacritics beforehand.

Is a case- and diacritic-folded equality comparison equivalent to an equality comparison between two normalized strings, or are there aspects to folding that require both original strings as context? To put it another way, are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

/cc @anandthakker @kkaefer @apendleton @jcsg

1ec5 · 2017-05-18T21:54:56Z

are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through lowercase and strip-diacritics?

In the absence of a fix for #3999, there are some problem areas caused by a lack of locale information. Should strip-diacritics add a tittle to ı (found in Turkish) in order for strip-diacritics("Kırşehir") == "Kirsehir"? Should lowercase("GROSSER STERN") == "sroßer stern"? The equality operator itself could be tolerant of differences in case or diacritics in these cases.

jcsg · 2017-05-19T13:33:23Z

There is a human- and machine-readable file published by Unicode here that details all the special casing operations that we could catch. They explicitly use two-letter language codes for some of these rules, so locale specific information that uses those codes could transparently connect to the Unicode standard.

The rules listed there cover the cases that you discussed in the previous comment, and list special cases for ligatures (Latin and Armenian), for where there is no uppercase precomposed character (all Greek and Latin), conditional mappings that depend on position within a word (all Greek), and a category of "Language-Sensitive Mappings" which contain rules for characters in Lithuanian, Turkish, and Azeri (Azerbaijani).

The comprehensive list of all other "normal" casing conventions are in:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
with a guide to understanding that file format here:
ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

The standard UnicodeData.txt file also gives information on the character decomposition mapping. I need to do more research on these decomposition mappings, but a good starting point may be to try to associate any Latin-ish precomposed character that we'd want to de-diacritize with only the strict ASCII letters that we find in the character's decomposition mappings. (As far as I could tell, it wasn't clear that unidecode actually used Unicode standards for reference, although it also does de-diacritization.)

If we implement our own lowercasing and de-diacritization functions, I think we should do so according to the current Unicode standards themselves unless something prevents that.

cc: @boblannon

boblannon · 2017-05-19T13:57:18Z

on geocoding, we've done away with unidecode altogether. we currently have a dedicated script, using manually curated rules written mostly by @apendleton: https://github.com/mapbox/carmen/blob/master/lib/util/remove-diacritics.js

also worth knowing, though: Javascript has character decomposition built in: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

boblannon · 2017-05-19T13:58:25Z

....so you could have

// Compatibly-decomposed (NFKD)

// U+0073: LATIN SMALL LETTER S
// U+0323: COMBINING DOT BELOW
// U+0307: COMBINING DOT ABOVE
str.normalize('NFKD'); // '\u0073\u0323\u0307'

and then a separate step that stripped out the combining dot characters.

kkaefer · 2017-05-19T14:30:29Z

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs. It also supports diacritic folding ("Unaccenting"): https://bitbucket.org/alekseyt/nunicode#markdown-header-unaccenting as well as case folding.

1ec5 · 2017-05-19T21:28:24Z

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with. Assuming we don’t, would we get more reliable results by making the equality operator optionally aware of diacritic folding than by composing equality with diacritic-stripping? Perhaps that would allow us multiple levels of canonicalization. For example:

["==", "Þingvellir", "Thingvellir", "diacritic-insensitive"]
["==", "Phơ 54", "Phở 54", "diacritic-insensitive"]

Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs.

Nunicode is similar in concept to unidecode: it attempts to strip diacritics without accounting for language-specific rules.

Do we have a nunicode analogue in JavaScript? Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

1ec5 · 2017-10-26T01:29:06Z

Could we use Intl.Collator to implement diacritic-insensitive equality without exposing a general-purpose diacritic-stripping function?

Case and diacritic folding can be implemented in JavaScript (for GL JS) like this:

Intl.Collator({ sensitivity: 'base', usage: 'search' }).compare('ä', 'a') === 0

The same functionality can be implemented in Objective-C (for iOS and macOS) like this in string_nsstring.mm:

[@"ä" compare:@"a" options:NSCaseInsensitiveSearch | NSDiacriticInsensitiveSearch range:NSMakeRange(0, @"ä".length) locale:nil] == NSOrderedSame;

and in Java (for Android) like this:

Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
collator.equals("ä", "a");

kkaefer · 2017-11-09T16:41:28Z

The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with.

This is still a valid concern, e.g. in German, diacritic folding for umlauts typically means adding an e after the vowel (München -> Muenchen), while it doesn't in Swedish (Malmö -> Malmo).

1ec5 · 2017-11-10T00:06:09Z

Agreed. The assumption in #4136 (comment) is that the strings in question are in the same language as the browser/system. But that’s a difficult assumption to make across a world map.

Supposing we implement one of the proposals in #3999 (comment) for telling GL a given source property’s language, I think it would still be desirable to focus on enhancing the equality operator rather than composing it with a diacritic-stripping operator. (In the code examples above, all three APIs accept a language identifier.) After all, the vector tiles aren’t the only places that an expression might get its strings from; strings could also be embedded literally or come from elsewhere in runtime styling code.

ChrisLoer · 2018-03-01T23:31:46Z

I'm currently evaluating an implementation of diacritic-insensitive equality (but not diacritic stripping, or even diacritic-insensitive contains/begins-with/etc. -- will those be important?) that takes a locale as an argument (could default to current locale).

Before committing to a JS implementation, I did a survey of how we could implement this on all our supported platforms -- we want to avoid as much as possible any subtle differences that could cause map rendering to change from one platform to the next, and we also want to avoid expensive bundling of collation rules:

JS: Intl.Collator. We don't control the underlying implementation here, but it looks to me on quick inspection like both Chromium and Firefox are using ICU in their underlying implementations.
iOS/macOS: Diacritic-insensitive NSPredicate. I don't know for sure, but I suspect that under the hood the Apple implementation is based on ICU4C. Edit: @1ec5 already pointed out there's a simpler way to do this with NSDiacriticInsensitiveSearch NSStringCompareOption..
Android: java.text.Collator with SECONDARY strength. I believe the java.text implementation is meant to use the same logic as ICU4J, which should stay in sync with ICU4C.
Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.
Other (linux/glfw/node/etc.): Statically linked ICU::Collator with SECONDARY strength. The reason for not using this as the default across gl-native is that it requires bundling collation data with the app.

brunoabinader · 2018-03-12T12:19:21Z

Qt: ?? I don't see a way to do this with QCollator, @tmpsantos or @brunoabinader do you have ideas here? I think since we're already linking against a system provided ICU for BiDi, that might be the approach to follow here.

Unfortunately, Qt does not provide a mechanism for ignoring diacritics. Qt uses ICU internally e.g. for QCollator but does not expose it as part of its public APIs. Some workarounds are explored in https://stackoverflow.com/questions/14009522/how-to-remove-accents-diacritic-marks-from-a-string-in-qt but there is no silver bullet. One workaround is to add a custom lookup table for searching/replacing characters with their canonical/compatible versions.

1ec5 · 2018-03-21T06:08:50Z

I keep coming around to the idea that what we need most is locale-aware comparisons, almost orthogonal to a way to explicitly ignore diacritics when comparing. The latter would effectively be a generic diacritic-stripping operation that gives you a Boolean instead of the stripped string, and it would suffer from the same problems:

A lack of support on some platforms (Qt)
Incorrect behavior for languages with complex rules – for example, in Vietnamese, xoa ≠ xóa = xoá and Đ ≠ D+-
Greater verbosity compared to specifying a single collation locale

The upside is that you could use a diacritic-insensitive comparison in a language-agnostic stylesheet, but in some sense a language is always involved when comparing strings, even if it’s just the default C locale.

ChrisLoer · 2018-03-23T22:48:23Z

I keep coming around to the idea that what we need most is locale-aware comparisons

That is the direction we're heading, see #6270 (comment) in the GL JS "Collator" PR.

ChrisLoer · 2018-04-16T18:10:19Z

Closing this as fixed with #6270, which adds support for diacritic-insensitve comparisons. We don't have support for a diacritic "stripping" transliterator, which is harder to define and hopefully not necessary for the most important use cases.

lucaswoj mentioned this issue Feb 1, 2017

Case and diacritic folding for filter operations mapbox/mapbox-gl-style-spec#548

Closed

1ec5 added the cross-platform 📺 Requires coordination with Mapbox GL Native (style specification, rendering tests, etc.) label Feb 3, 2017

kkaefer mentioned this issue May 17, 2017

[for discussion] Draft proposal - arbitrary expressions for style functions #4715

Closed

6 tasks

jfirebaugh added the feature 🍏 label Aug 1, 2017

1ec5 changed the title ~~Case and diacritic folding for filter operations~~ Case and diacritic folding for expression operators Oct 26, 2017

1ec5 mentioned this issue Feb 20, 2018

Add expression operators for locale matching (system languages) #6197

Open

ChrisLoer mentioned this issue Mar 2, 2018

'Collator' expression for controlling case and diacritic sensitivity in string comparisons #6270

Merged

anandthakker mentioned this issue Apr 9, 2018

Master ticket tracking expression API completeness #6484

Open

16 tasks

ChrisLoer closed this as completed Apr 16, 2018

ChrisLoer mentioned this issue Apr 16, 2018

Port 'Collator' string comparison support to native mapbox/mapbox-gl-native#11692

Closed

1ec5 mentioned this issue Apr 26, 2018

Many NSPredicate and NSExpression operators have no equivalent in style JSON mapbox/mapbox-gl-native#11786

Open

72 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case and diacritic folding for expression operators #4136

Case and diacritic folding for expression operators #4136

lucaswoj commented Feb 1, 2017

mollymerp commented Feb 8, 2017

incanus commented Feb 8, 2017

1ec5 commented May 18, 2017 •

edited

Loading

1ec5 commented May 18, 2017

jcsg commented May 19, 2017

boblannon commented May 19, 2017

boblannon commented May 19, 2017

kkaefer commented May 19, 2017

1ec5 commented May 19, 2017

1ec5 commented Oct 26, 2017 •

edited

Loading

kkaefer commented Nov 9, 2017

1ec5 commented Nov 10, 2017 •

edited

Loading

ChrisLoer commented Mar 1, 2018 •

edited

Loading

brunoabinader commented Mar 12, 2018

1ec5 commented Mar 21, 2018 •

edited

Loading

ChrisLoer commented Mar 23, 2018

ChrisLoer commented Apr 16, 2018

Case and diacritic folding for expression operators #4136

Case and diacritic folding for expression operators #4136

Comments

lucaswoj commented Feb 1, 2017

mollymerp commented Feb 8, 2017

incanus commented Feb 8, 2017

1ec5 commented May 18, 2017 • edited Loading

1ec5 commented May 18, 2017

jcsg commented May 19, 2017

boblannon commented May 19, 2017

boblannon commented May 19, 2017

kkaefer commented May 19, 2017

1ec5 commented May 19, 2017

1ec5 commented Oct 26, 2017 • edited Loading

kkaefer commented Nov 9, 2017

1ec5 commented Nov 10, 2017 • edited Loading

ChrisLoer commented Mar 1, 2018 • edited Loading

brunoabinader commented Mar 12, 2018

1ec5 commented Mar 21, 2018 • edited Loading

ChrisLoer commented Mar 23, 2018

ChrisLoer commented Apr 16, 2018

1ec5 commented May 18, 2017 •

edited

Loading

1ec5 commented Oct 26, 2017 •

edited

Loading

1ec5 commented Nov 10, 2017 •

edited

Loading

ChrisLoer commented Mar 1, 2018 •

edited

Loading

1ec5 commented Mar 21, 2018 •

edited

Loading