-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case and diacritic folding for expression operators #4136
Comments
TIL what diacritic means |
Everyone's a diacritic. |
Over in #4715 (comment), we’re discussing a potential syntax for equality comparisons in expressions. Although this issue talks about filters, getting case and diacritic folding into expressions would address some of the main use cases for this functionality (for bilingual labeling in particular, but not for highlighting “fuzzy-matched” search results). #4715 (comment) asks whether we should incorporate case and diacritic folding into the proposal as some sort of modifier on equality comparisons, or whether we should rely on the style author to run each operand through functions like Is a case- and diacritic-folded equality comparison equivalent to an equality comparison between two normalized strings, or are there aspects to folding that require both original strings as context? To put it another way, are there strings that should be considered equal after case- and diacritic-folding but would differ after each string is run through |
In the absence of a fix for #3999, there are some problem areas caused by a lack of locale information. Should |
There is a human- and machine-readable file published by Unicode here that details all the special casing operations that we could catch. They explicitly use two-letter language codes for some of these rules, so locale specific information that uses those codes could transparently connect to the Unicode standard. The rules listed there cover the cases that you discussed in the previous comment, and list special cases for ligatures (Latin and Armenian), for where there is no uppercase precomposed character (all Greek and Latin), conditional mappings that depend on position within a word (all Greek), and a category of "Language-Sensitive Mappings" which contain rules for characters in Lithuanian, Turkish, and Azeri (Azerbaijani). The comprehensive list of all other "normal" casing conventions are in: The standard If we implement our own lowercasing and de-diacritization functions, I think we should do so according to the current Unicode standards themselves unless something prevents that. cc: @boblannon |
on geocoding, we've done away with unidecode altogether. we currently have a dedicated script, using manually curated rules written mostly by @apendleton: https://github.com/mapbox/carmen/blob/master/lib/util/remove-diacritics.js also worth knowing, though: Javascript has character decomposition built in: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize |
....so you could have
and then a separate step that stripped out the combining dot characters. |
Noting here that we're using https://bitbucket.org/alekseyt/nunicode in Mapbox GL for platforms that don't have Unicode APIs. It also supports diacritic folding ("Unaccenting"): https://bitbucket.org/alekseyt/nunicode#markdown-header-unaccenting as well as case folding. |
The discussion above about the Unicode standard makes it clear that proper diacritic folding requires knowing which language we’re dealing with. Assuming we don’t, would we get more reliable results by making the equality operator optionally aware of diacritic folding than by composing equality with diacritic-stripping? Perhaps that would allow us multiple levels of canonicalization. For example:
Nunicode is similar in concept to unidecode: it attempts to strip diacritics without accounting for language-specific rules. Do we have a nunicode analogue in JavaScript? Could we use |
Case and diacritic folding can be implemented in JavaScript (for GL JS) like this: Intl.Collator({ sensitivity: 'base', usage: 'search' }).compare('ä', 'a') === 0 The same functionality can be implemented in Objective-C (for iOS and macOS) like this in string_nsstring.mm: [@"ä" compare:@"a" options:NSCaseInsensitiveSearch | NSDiacriticInsensitiveSearch range:NSMakeRange(0, @"ä".length) locale:nil] == NSOrderedSame; and in Java (for Android) like this: Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
collator.equals("ä", "a"); |
This is still a valid concern, e.g. in German, diacritic folding for umlauts typically means adding an |
Agreed. The assumption in #4136 (comment) is that the strings in question are in the same language as the browser/system. But that’s a difficult assumption to make across a world map. Supposing we implement one of the proposals in #3999 (comment) for telling GL a given source property’s language, I think it would still be desirable to focus on enhancing the equality operator rather than composing it with a diacritic-stripping operator. (In the code examples above, all three APIs accept a language identifier.) After all, the vector tiles aren’t the only places that an expression might get its strings from; strings could also be embedded literally or come from elsewhere in runtime styling code. |
I'm currently evaluating an implementation of diacritic-insensitive equality (but not diacritic stripping, or even diacritic-insensitive contains/begins-with/etc. -- will those be important?) that takes a locale as an argument (could default to current locale). Before committing to a JS implementation, I did a survey of how we could implement this on all our supported platforms -- we want to avoid as much as possible any subtle differences that could cause map rendering to change from one platform to the next, and we also want to avoid expensive bundling of collation rules:
|
Unfortunately, Qt does not provide a mechanism for ignoring diacritics. Qt uses ICU internally e.g. for |
I keep coming around to the idea that what we need most is locale-aware comparisons, almost orthogonal to a way to explicitly ignore diacritics when comparing. The latter would effectively be a generic diacritic-stripping operation that gives you a Boolean instead of the stripped string, and it would suffer from the same problems:
The upside is that you could use a diacritic-insensitive comparison in a language-agnostic stylesheet, but in some sense a language is always involved when comparing strings, even if it’s just the default C locale. |
That is the direction we're heading, see #6270 (comment) in the GL JS "Collator" PR. |
Closing this as fixed with #6270, which adds support for diacritic-insensitve comparisons. We don't have support for a diacritic "stripping" transliterator, which is harder to define and hopefully not necessary for the most important use cases. |
From @1ec5 on October 21, 2016 10:26
As discussed in mapbox/mapbox-gl-native#6781, it can often be necessary for a filter to perform a case-insensitive or diacritic-insensitive comparison. Variations on the existing operators should be added to support these options.
Here’s an example design (but by no means the best):
Implementing case-insensitive comparisons should be trivial on all the platforms supported by Mapbox GL. On the other hand, while there are fine options for diacritic folding on the native platforms, JavaScript would have to rely on a library for diacritic-insensitive comparisons.
/cc @incanus @lucaswoj @jfirebaugh
Copied from original issue: mapbox/mapbox-gl-style-spec#548
The text was updated successfully, but these errors were encountered: