-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'Collator' expression for controlling case and diacritic sensitivity in string comparisons #6270
Conversation
Hmm, the expression test failure points out how easy it will be to shoot yourself in the foot authoring styles that depend on locally installed locales (I wrote the test with the "de" locale present on my machine, but apparently it's not installed on our CI image). Browser support may also be an issue (https://caniuse.com/#search=Intl.Collator), although it looks pretty widespread, and I think it's probably reasonable to just fall back to plain equality when we don't have browser support. |
If we were to provide a more comprehensive set of collation-related expression operators:
|
I think if we had a
We could make that an optional third argument to the equality/comparison operators (I'm not sure how awkward it would be to tack on an optional argument there?). If we had that,
|
df1b1a3
to
d8083ea
Compare
@@ -1,5 +1,5 @@ | |||
{ | |||
"expression": ["diacritic-insensitive-equal", "de", ["get", "lhs"], ["get", "rhs"]], | |||
"expression": ["==", ["string",["get", "lhs"]], ["get", "rhs"], ["collator", true, false, "de"]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should consider accepting an single object literal for collator options, e.g. ["collator", { caseSensitive: false, diacriticSensitive: false, locale: 'de' }]
. It's not how any existing expressions work, but this is the first case where we really need a set of options, and a dictionary is much clearer than a positional arguments for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And while we're considering such a departure, we might as well ask: should we simply accept a collator options object directly as an optional fourth argument to collating expressions?
["==", ["string",["get", "lhs"]], ["get", "rhs"], { caseSensitive: false, diacriticSensitive: false, locale: 'de' }]
@ChrisLoer So excited to see explorations on this issue! Implementing collation parameters together as one set of options seems to make sense, since
From the js intl.collator doc, |
Ignoring punctuation is actually supported publicly by Qt, but not by iOS or macOS. On iOS and macOS, we could possibly roll our own implementation by stripping punctuation before comparing. |
d8083ea
to
5fa63a9
Compare
5fa63a9
to
0aaa4bf
Compare
OK, this branch currently has:
Most of the tests are written to use the default locale. If they're run on a system where the default locale isn't There's one test that runs against the I haven't added collation support to the old filters, but it should always be possible to simply author a filter with a new expression to get the functionality, right? Current style authors use Before merging this, I'd like to be at least relatively confident in the equivalent native behavior. Some issues:
|
62865d3
to
080a8fd
Compare
I think this is ready to go, pending review. The previously "default" tests now specify the "en" locale -- so the tests require "en" to be installed, but don't require it to be the default locale on the system. The "de" locale test is now hardwired to pass (without doing any comparisons) if "de" isn't installed. @1ec5 has convinced me to stick with BCP 47 language tags as the interface for defining a locale, with the idea that we'll implement a BCP 47 parser in native and then use as much as we can based on the platform we're running on (e.g. we'd parse the "script" tag but ignore it on ICU-based systems). @kkaefer thinks that where we're bundling ICU (linux, Qt), install size is not a major concern and we can get away with just including the entire resource bundle (so we can skip worrying about which locales to bundle). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see myself needing to look up docs every time I see ["collator", true, false, ...]
. Can we consider some alternatives with better code readability?
- Enums:
["collator", "case-sensitive" | "case-insensitive", "diacritic-sensitive" | "diacritic-insensitive", ...]"
- Named parameter syntax with objects:
["collator", {"case-sensitive": true, "diacritic-sensitive": false}]
- Named parameter syntax with s-exprs:
["collator", ["case-sensitive", true], ["diacritic-sensitive", false], ...]
The mapping to Intl.Collator
sensitivity
option values implemented here is:
case-sensitive | case-insensitive | |
---|---|---|
diacritic-sensitive | "variant" |
"accent" |
diacritic-insensitive | "case" |
"base" |
For the "variant" option, MDN docs state "other differences may also be taken into consideration." Do we know what those might be and whether we care about them?
At least for the Firefox implementation, "variant" maps to ICU's locale-dependent "tertiary strength": https://dxr.mozilla.org/mozilla-central/source/js/src/builtin/intl/Collator.cpp#342-343 ICU has a further "IDENTICAL" option which is based directly on unicode code point equality -- the impression I get is that the difference is likely to be in whether characters that "look the same" but have different code points get treated as equal. "IDENTICAL" isn't exposed by It is a little weird that a case-and-diacritic-sensitive collator equality will still be subtly different from "==" without a collator, but I think it makes sense to say "even if your collator is case/diacritic-sensitive, the fact that you're using it implies you want some form of locale-aware comparison".
Yeah, of those three suggestions I think my preferences are in the same order you wrote them. @anandthakker any thoughts on what fits in best with the rest of expressions? |
My personal preference is still for the syntax I proposed earlier:
|
@anandthakker and I discussed this in more detail today and decided on the syntax:
We went back and forth on whether we could avoid introducing a collator type/expression entirely and just pass the configuration object as an argument to operators that supported collators. The use-case that pushed us towards keeping an explicit collator type is wanting to support defining collators in "let" expressions. Because we think it's very possible collator expressions could themselves depend on something like There's a little bit of an open question on whether we should make |
080a8fd
to
9a7af92
Compare
@anandthakker I just added serialization support for the collator functionality (haven't yet switched to object notation). One thing that came up from bootstrapping the tests is that we should consider whether |
775093c
to
2e107ba
Compare
Updated to new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is looking!
I think the object syntax is the clearest of the alternatives we've considered, but it's true that having an object whose values are parseable expressions is a new precedent, so we should probably check a couple other moral compasses on this.
ignorePunctuation?: boolean, | ||
numeric?: boolean, | ||
caseFirst?: 'upper' | 'lower' | 'false' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave a comment referencing facebook/flow#1270, which tracks adding these to flow's library definitions
if (!caseSensitive) return null; | ||
|
||
const diacriticSensitive = context.parse( | ||
options['diacriticSensitive'] === undefined ? false : options['diacriticSensitive'], 2, BooleanType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second argument to parse
here (and below for locale) should still be 1 -- it's the index into the array defining the current expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 maybe we should generalize index
argument (and ParsingContext#path
) to allow strings, so that we could do context.concat(1).parse(..., 'diacriticSensitive', BooleanType)
and our future better-formatted error messages would be able to point to the specific field in case of a parsing/typing error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good idea, but OK if I just correct the index to "1" for now?
} | ||
|
||
evaluate(ctx: EvaluationContext) { | ||
return new CollatorInstantiation(this.caseSensitive.evaluate(ctx), this.diacriticSensitive.evaluate(ctx), this.locale ? this.locale.evaluate(ctx) : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the cost of creating a new Collator
is such that we should cache these. Something we can come back to later, as long as current benchmarks are looking okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the cost of new Intl.Collator
(which we do on each compare
)? It would be easy enough to move into the CollatorInstantiation
constructor (allowing all of the uses in one expression to share the same constructor). But doesn't seem like something we should bother doing unless we see it showing up as a hot spot...
@@ -58,6 +59,8 @@ class Literal implements Expression { | |||
return ["literal", this.value]; | |||
} else if (this.value instanceof Color) { | |||
return ["rgba"].concat(this.value.toArray()); | |||
} else if (this.value instanceof CollatorInstantiation) { | |||
return this.value.serialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here for the same reason as the Color
branch above, right? I.e., because constant folding can cause an expression tree to contain Literal
with value: Color
or value: CollatorInstantiation
?
It's pre-existing, but I think this is a bit of a wart, that Literal
can hold a value for which there's no actual literal, and thus has to know (or, in this case, the value itself has to know) how to serialize an expression that it wouldn't be responsible for parsing. @ChrisLoer would you mind adding a quick comment in each of these two branches just briefly noting that they're here to cover the constant-folding case?
caseFirst?: 'upper' | 'lower' | 'false' | ||
} | ||
|
||
export class CollatorInstantiation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested renames:
declare class Collator
->declare class Intl$Collator
(this should be okay, since it's only being declared at top-level so that it can be referenced within theIntl
class declaration, right?)CollatorInstantiation
->Collator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a different note: values.js
will need to become aware of this class, e.g., for isValue
, typeOf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated values.js
, but I haven't actually figured out an expression test that would behave differently with the old isValue => false
and typeOf => ObjectType
behavior...
src/style-spec/reference/v8.json
Outdated
@@ -2193,6 +2193,15 @@ | |||
} | |||
} | |||
}, | |||
"collator": { | |||
"doc": "Returns a `collator` for use in locale dependent comparison operations. The `caseSensitive` and `diacriticSensitive` options default to `false`. The `locale` argument specifies the IETF language tag of the locale to use. If none is provided, the default locale is used. If the requested locale is not available, the `collator` will use a system-defined fallback locale. Use `resolved-locale` to test the results of locale fallback behavior.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: locale dependent
-> locale-dependent
?
e3a9583
to
8764adb
Compare
['collator', <case-sensitive>, <diacritic-sensitive>, (optional) <IETF language tag>] Add 'resolved-locale' to test if requested collation locale is available: ['resolved-locale', <collator>] '==','!=','<','<=,'>','>=' now take an optional 'collator' argument when they are used with 'string' arguments.
Support/test default behavior when case/diacriticSensitivity isn't specified.
- CollatorInstantiation -> Collator (and try to keep Intl.Collator in its own scope) - Collator holds onto its Intl.Collator now so it can be used across multiple compares - Add Collator to values.js - Fix parse indexes - Add/tidy comments
8764adb
to
acaf489
Compare
I'm a developer working on Firefox's Gecko engine and on the collator component of ICU4X. I have identified the data for search collations as having a disproportionately large contribution to binary size and, therefore, am exploring ways to make the data size smaller. The vast, vast majority of the usage of What kind of user-visible feature does mapbox-gl-js use the accent-insensitive comparison for? From the source code, it seems that not only are equal/not-equal filtering supported but also less-than/greater-than. The semantics of Also, the comments above talk only about case and diacritic-insensitivity. Does mapbox-gl-js rely on the specifics of |
I don't have anything to do with mapbox, but the same problem likely exists in maplibre (which is an open-source fork of mapbox, that I also use / contribute to). So take everything here with a grain of salt.
Just by number of code appearances, or by how often it is actually executed / ran?
I have a hard time understanding the technical differences between "sort" and "search" (or why (I can imagine problematic cases for
This PR added the "collator"-expression to the mapbox style expressions, which is an interpreted programming language used by developers who are using mapbox to visualize their maps. These expressions are typically evaluated for each map feature (say, a building / a point-of-interest, ..) and control their look / label. So how exactly this feature is used depends on each particular map which is designed/styled/created using mapbox. I could imagine that this collator feature is being used to compare feature names to that in some textbox or to highlight a specific building/country/.. by matching the name. The corresponding docs (for the mapbox style expressions) are at:
You can check the unit tests in this PR to see how it would be used in practice. I'd guess However, I also doubt that collators are ever used for sorting (as each expression runs for a single feature, so there isn't anything useful to sort). |
I helped @hsivonen with collecting the data.
By code appearances. In particular, response bodies for requests in httparchive (11.5m pages) that match the regexp
|
My inference from the evidence that I've seen is that it started with the notion that for various Latin-script languages collations happen to already provide a diacritic-insensitive equality definition such that some things that are technically diacritics but on a per-language basis aren't treated as such are not ignored so that e.g. o and ö in Finnish, Swedish, Icelandic and Turkish can be considered unequal even when comparing diacritic-insensitively. Therefore, if you want diacritic-insensitivity where some diacritics are not treated as diacritics in a language-dependent way, you can reuse collation data. Then it was observed that the concept could be extended to Arabic and Thai scripts in a way that doesn't reuse a sorting tailoring, so search collations become a separate thing with the script-level aspects going into a common search root. (Also, once sort and search got separated, some Latin-script languages got search behavior that diverges from sorting.)
In sorting, there are two layers of data: The root collation and, optionally, a language-specific tailoring overlay. In search, there are logically three layers of data: the root for sorting, a search root overlaid on that, and then, optionally, a language-specific tailoring. However, implementations only admit two layers, so for each language that's supposed to reuse its sort tailoring for searching, you instead carry a search tailoring that contains a merged copy of the search root and a copy of the sort tailoring for the language. The obvious solution of "allow three layers of data, then" would make ctrl/cmd-f perform worse in apps that use search collations for interactive incremental search (which motivates the data but isn't supported by the Web API). I don't think it's realistic for me to pursue that in the context of ICU4C, but depending on what I learn, I might pursue it in the context of ICU4X.
There are actual differences, but I'm trying to figure out if this mapbox-gl-js use case is meant to care about them. Depending on language, the differences for what becomes greater-than and what becomes less-than are so notable that
Thanks. So far, it sounds like it would be tolerable for Arabic-script and Thai matching to become stricter (less insensitive to certain marks) when the resolved locale is a Latin-script language with its language-specific rules for what diacritics are not ignored in diacritic-insensitive search. (Again, English and French are do not have such language-specific rules.) |
Potential implementation of a fix for #4136.
For "locale", we could:
It does feel odd to have this operator implemented separately from
==
,!=
,<
,<=
,>
, and>=
, (and alsofilter-*-in
) but:!
) gives us what we need for the specific use case of changing map behavior when two values are "near duplicates"An alternative approach might be to make
Equals
,lt
,gt
, etc. all take an optional "Collation" as an argument and then add expression support for defining a collation. One risk of this approach is that the more customizability we expose the harder it'll be to maintain consistent behavior across platforms (although we expect the underlying implementations to usually be based on some form of ICU)./cc @1ec5 @anandthakker @kkaefer @nickidlugash