Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Collator' expression for controlling case and diacritic sensitivity in string comparisons #6270

Merged
merged 3 commits into from
Apr 16, 2018

Conversation

ChrisLoer
Copy link
Contributor

Potential implementation of a fix for #4136.

For "locale", we could:

  • Get rid of the argument, and always use the runtime's default locale. Advantage: easiest from the point of view of the expression author.
  • Allow a single optional string as an argument (currently implemented). Advantage: not too complicated, and allows basic localization (e.g. "let's use German collation rules for this map").
  • Allow array of locale strings as an argument. Advantage: allows definition of fallback rules when the set of locally supported locales isn't known ahead of time. Disadvantage: more cumbersome to specify.

It does feel odd to have this operator implemented separately from ==,!=,<,<=,>, and>=, (and also filter-*-in) but:

  • The equality operator (along with !) gives us what we need for the specific use case of changing map behavior when two values are "near duplicates"
  • Greater/less than operators probably (?) make more sense in the context of being able to specify general purpose collation rules.

An alternative approach might be to make Equals, lt, gt, etc. all take an optional "Collation" as an argument and then add expression support for defining a collation. One risk of this approach is that the more customizability we expose the harder it'll be to maintain consistent behavior across platforms (although we expect the underlying implementations to usually be based on some form of ICU).

/cc @1ec5 @anandthakker @kkaefer @nickidlugash

@ChrisLoer
Copy link
Contributor Author

Hmm, the expression test failure points out how easy it will be to shoot yourself in the foot authoring styles that depend on locally installed locales (I wrote the test with the "de" locale present on my machine, but apparently it's not installed on our CI image).

Browser support may also be an issue (https://caniuse.com/#search=Intl.Collator), although it looks pretty widespread, and I think it's probably reasonable to just fall back to plain equality when we don't have browser support.

@jfirebaugh
Copy link
Contributor

If we were to provide a more comprehensive set of collation-related expression operators:

  • What else would we likely want to provide?
  • Would diacritic-insensitive-equal extend neatly into a design that encompasses these additional features, or would it tend to become an API wart?

@ChrisLoer
Copy link
Contributor Author

I think if we had a Collator type that included:

  • Locale (string)
  • case-sensitivity (bool)
  • diacritic-sensitivity (bool)

We could make that an optional third argument to the equality/comparison operators (I'm not sure how awkward it would be to tack on an optional argument there?). If we had that, diacritic-insensitive-equal would just be a left-over wart, I think.

Intl.Collator exposes some features that could potentially be useful (like "numeric" sorting or controlling whether to sort lower or upper case first), although they wouldn't be relevant for equality checks. We could add them to our Collator in the future, if we knew a consistent way to implement them (I don't think there's a direct analog to those exposed by NSString compare).

@ChrisLoer ChrisLoer force-pushed the diacritic-equality branch 2 times, most recently from df1b1a3 to d8083ea Compare March 9, 2018 22:02
@@ -1,5 +1,5 @@
{
"expression": ["diacritic-insensitive-equal", "de", ["get", "lhs"], ["get", "rhs"]],
"expression": ["==", ["string",["get", "lhs"]], ["get", "rhs"], ["collator", true, false, "de"]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider accepting an single object literal for collator options, e.g. ["collator", { caseSensitive: false, diacriticSensitive: false, locale: 'de' }]. It's not how any existing expressions work, but this is the first case where we really need a set of options, and a dictionary is much clearer than a positional arguments for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And while we're considering such a departure, we might as well ask: should we simply accept a collator options object directly as an optional fourth argument to collating expressions?

["==", ["string",["get", "lhs"]], ["get", "rhs"], { caseSensitive: false, diacriticSensitive: false, locale: 'de' }]

@nickidlugash
Copy link

@ChrisLoer So excited to see explorations on this issue!

Implementing collation parameters together as one set of options seems to make sense, since
I think users will likely want to use case-insensitivity and diacritic-insensitivity at the same time (or at least, we'll want to in our core styles).

What else would we likely want to provide?

From the js intl.collator doc, ignorePunctuation also seems like it could be useful. I don't know how support for this varies across platforms.

@1ec5
Copy link
Contributor

1ec5 commented Mar 21, 2018

From the js intl.collator doc, ignorePunctuation also seems like it could be useful. I don't know how support for this varies across platforms.

Ignoring punctuation is actually supported publicly by Qt, but not by iOS or macOS. On iOS and macOS, we could possibly roll our own implementation by stripping punctuation before comparing.

@ChrisLoer ChrisLoer changed the title POC: Diacritic-insensitive equality POC: 'Collator' expression for controlling case and diacritic sensitivity in string comparisons Mar 21, 2018
@ChrisLoer ChrisLoer changed the base branch from renderable-string to master March 22, 2018 23:58
@ChrisLoer ChrisLoer changed the title POC: 'Collator' expression for controlling case and diacritic sensitivity in string comparisons 'Collator' expression for controlling case and diacritic sensitivity in string comparisons Mar 23, 2018
@ChrisLoer
Copy link
Contributor Author

OK, this branch currently has:

  • collator expression that can create a collator based on locale/case-sensitivity/diacritic-sensitivity
  • Support for collator in ==,!=,<,<=,>,>=
  • ["resolved-locale", collator] to inspect if you're getting the locale you're asking for

Most of the tests are written to use the default locale. If they're run on a system where the default locale isn't en they may break. 🤔 I could make all of them specify en and count on that at least being installed everywhere they run? Unfortunately hard to test the "default" behavior without making tests that are bound to behave differently on different platforms.

There's one test that runs against the de locale, and it uses resolved-locale to change what it expects based on whether de is installed. This is weird because it means that it tests something different on CI (where it's not installed) vs. locally on OS X (where it is installed). I think this is OK? Although right now it assumes that if you don't have de you must be using en, probably better to just make it pass no matter what if you don't have de.

I haven't added collation support to the old filters, but it should always be possible to simply author a filter with a new expression to get the functionality, right?

Current style authors use downcase/upcase to do case-insensitive comparisons. This is probably a preferable alternative, but I'm not sure if we need to do anything to push existing users towards this.

Before merging this, I'd like to be at least relatively confident in the equivalent native behavior. Some issues:

  • Uncertain support across platforms for ignorePunctuation seems to me like a good argument for holding off on that, at least for now.
  • Currently "locale" is specified as a "BCP 47 language tag" because that's what Intl.Collator takes, but I'm worried this allows over-specification relative to the platform support we'll end up with. If we can get away with it, it would be nice to restrict the locale to a ISO 639-1 Code, as a kind of "least common denominator". That would however rule out specifying regional differences within languages. I can imagine a future in which we generate "locale" objects and use them for formatters (where region would matter more) as well as collation, but I don't want to get too far ahead of myself.
  • On Qt, as far as I can tell we can do case-insensitive comparison or default-locale comparison but we can't combine them and we can't control the locale. Some (old) discussion of the issue.... We do have a build of Qt that bundles ICU for Arabic support, but bundling collation data as well would significantly increase the build size.
  • Where we end up depending on bundled ICU collation support (thankfully not iOS or Android), we'll have to choose what locales to bundle. Should we try to bundle a minimal set and leave it at that? Or expose it as a build time configuration option?

@ChrisLoer
Copy link
Contributor Author

I think this is ready to go, pending review.

The previously "default" tests now specify the "en" locale -- so the tests require "en" to be installed, but don't require it to be the default locale on the system. The "de" locale test is now hardwired to pass (without doing any comparisons) if "de" isn't installed.

@1ec5 has convinced me to stick with BCP 47 language tags as the interface for defining a locale, with the idea that we'll implement a BCP 47 parser in native and then use as much as we can based on the platform we're running on (e.g. we'd parse the "script" tag but ignore it on ICU-based systems).

@kkaefer thinks that where we're bundling ICU (linux, Qt), install size is not a major concern and we can get away with just including the entire resource bundle (so we can skip worrying about which locales to bundle).

Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see myself needing to look up docs every time I see ["collator", true, false, ...]. Can we consider some alternatives with better code readability?

  • Enums: ["collator", "case-sensitive" | "case-insensitive", "diacritic-sensitive" | "diacritic-insensitive", ...]"
  • Named parameter syntax with objects: ["collator", {"case-sensitive": true, "diacritic-sensitive": false}]
  • Named parameter syntax with s-exprs: ["collator", ["case-sensitive", true], ["diacritic-sensitive", false], ...]

The mapping to Intl.Collator sensitivity option values implemented here is:

case-sensitive case-insensitive
diacritic-sensitive "variant" "accent"
diacritic-insensitive "case" "base"

For the "variant" option, MDN docs state "other differences may also be taken into consideration." Do we know what those might be and whether we care about them?

@ChrisLoer
Copy link
Contributor Author

For the "variant" option, MDN docs state "other differences may also be taken into consideration." Do we know what those might be and whether we care about them?

At least for the Firefox implementation, "variant" maps to ICU's locale-dependent "tertiary strength":

https://dxr.mozilla.org/mozilla-central/source/js/src/builtin/intl/Collator.cpp#342-343

ICU has a further "IDENTICAL" option which is based directly on unicode code point equality -- the impression I get is that the difference is likely to be in whether characters that "look the same" but have different code points get treated as equal. "IDENTICAL" isn't exposed by Intl.Collator but it should be equivalent to plain-ol' javascript string comparison operators.

It is a little weird that a case-and-diacritic-sensitive collator equality will still be subtly different from "==" without a collator, but I think it makes sense to say "even if your collator is case/diacritic-sensitive, the fact that you're using it implies you want some form of locale-aware comparison".

Can we consider some alternatives with better code readability?

Yeah, of those three suggestions I think my preferences are in the same order you wrote them. @anandthakker any thoughts on what fits in best with the rest of expressions?

@anandthakker
Copy link
Contributor

@anandthakker any thoughts on what fits in best with the rest of expressions?

My personal preference is still for the syntax I proposed earlier:

simply accept a collator options object directly as an optional fourth argument to collating expressions:
["==", ["string",["get", "lhs"]], ["get", "rhs"], { caseSensitive: false, diacriticSensitive: false, locale: 'de' }]

@ChrisLoer
Copy link
Contributor Author

@anandthakker and I discussed this in more detail today and decided on the syntax:

['collator', { caseSensitive: <boolean>, diacriticSensitive: <boolean>, locale: <string> }]

We went back and forth on whether we could avoid introducing a collator type/expression entirely and just pass the configuration object as an argument to operators that supported collators. The use-case that pushed us towards keeping an explicit collator type is wanting to support defining collators in "let" expressions. Because we think it's very possible collator expressions could themselves depend on something like match-locales, we want to be able to use "let" to keep expressions that use collators from getting too verbose.

There's a little bit of an open question on whether we should make locale optional or require an explicit "default" parameter. The idea is that if locale is implicit, it's too easy for an expression author to think that case/diacritic-sensitivity doesn't depend on locale, and then be surprised when the behavior changes depending on user locale. On the other hand, it adds more verbosity and requires mixing a mapbox-specific enum with an otherwise BCP-47 argument. Also, Intl.Collator uses the convention of "unspecified locale => user default locale". So I could be convinced either way but right now my inclination is to keep it as an optional parameter.

@ChrisLoer
Copy link
Contributor Author

@anandthakker I just added serialization support for the collator functionality (haven't yet switched to object notation). One thing that came up from bootstrapping the tests is that we should consider whether collator is feature constant. The issue is that technically what the locale resolves to can change at any time. On those grounds I made it non-feature constant. But as I think about it more, that's probably enough of an edge case that we don't want to sacrifice the performance benefits of allowing constant folding?

@ChrisLoer ChrisLoer force-pushed the diacritic-equality branch 2 times, most recently from 775093c to 2e107ba Compare April 7, 2018 00:00
@ChrisLoer
Copy link
Contributor Author

Updated to new { "caseSensitive": true, "diacriticSensitive": false, "locale": "en" } syntax, with default values for all of them (case/diacritic insensitive in system locale). I removed the non-feature-constant flag for collator as well.

Copy link
Contributor

@anandthakker anandthakker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this is looking!

I think the object syntax is the clearest of the alternatives we've considered, but it's true that having an object whose values are parseable expressions is a new precedent, so we should probably check a couple other moral compasses on this.

ignorePunctuation?: boolean,
numeric?: boolean,
caseFirst?: 'upper' | 'lower' | 'false'
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave a comment referencing facebook/flow#1270, which tracks adding these to flow's library definitions

if (!caseSensitive) return null;

const diacriticSensitive = context.parse(
options['diacriticSensitive'] === undefined ? false : options['diacriticSensitive'], 2, BooleanType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument to parse here (and below for locale) should still be 1 -- it's the index into the array defining the current expression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we should generalize index argument (and ParsingContext#path) to allow strings, so that we could do context.concat(1).parse(..., 'diacriticSensitive', BooleanType) and our future better-formatted error messages would be able to point to the specific field in case of a parsing/typing error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea, but OK if I just correct the index to "1" for now?

}

evaluate(ctx: EvaluationContext) {
return new CollatorInstantiation(this.caseSensitive.evaluate(ctx), this.diacriticSensitive.evaluate(ctx), this.locale ? this.locale.evaluate(ctx) : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the cost of creating a new Collator is such that we should cache these. Something we can come back to later, as long as current benchmarks are looking okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the cost of new Intl.Collator (which we do on each compare)? It would be easy enough to move into the CollatorInstantiation constructor (allowing all of the uses in one expression to share the same constructor). But doesn't seem like something we should bother doing unless we see it showing up as a hot spot...

@@ -58,6 +59,8 @@ class Literal implements Expression {
return ["literal", this.value];
} else if (this.value instanceof Color) {
return ["rgba"].concat(this.value.toArray());
} else if (this.value instanceof CollatorInstantiation) {
return this.value.serialize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here for the same reason as the Color branch above, right? I.e., because constant folding can cause an expression tree to contain Literal with value: Color or value: CollatorInstantiation?

It's pre-existing, but I think this is a bit of a wart, that Literal can hold a value for which there's no actual literal, and thus has to know (or, in this case, the value itself has to know) how to serialize an expression that it wouldn't be responsible for parsing. @ChrisLoer would you mind adding a quick comment in each of these two branches just briefly noting that they're here to cover the constant-folding case?

caseFirst?: 'upper' | 'lower' | 'false'
}

export class CollatorInstantiation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested renames:

  • declare class Collator -> declare class Intl$Collator (this should be okay, since it's only being declared at top-level so that it can be referenced within the Intl class declaration, right?)
  • CollatorInstantiation -> Collator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a different note: values.js will need to become aware of this class, e.g., for isValue, typeOf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated values.js, but I haven't actually figured out an expression test that would behave differently with the old isValue => false and typeOf => ObjectType behavior...

@@ -2193,6 +2193,15 @@
}
}
},
"collator": {
"doc": "Returns a `collator` for use in locale dependent comparison operations. The `caseSensitive` and `diacriticSensitive` options default to `false`. The `locale` argument specifies the IETF language tag of the locale to use. If none is provided, the default locale is used. If the requested locale is not available, the `collator` will use a system-defined fallback locale. Use `resolved-locale` to test the results of locale fallback behavior.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: locale dependent -> locale-dependent?

['collator',
  <case-sensitive>,
  <diacritic-sensitive>,
  (optional) <IETF language tag>]

Add 'resolved-locale' to test if requested collation locale is available:
['resolved-locale', <collator>]

'==','!=','<','<=,'>','>=' now take an optional 'collator' argument when they are used with 'string' arguments.
Support/test default behavior when case/diacriticSensitivity isn't specified.
- CollatorInstantiation -> Collator (and try to keep Intl.Collator in its own scope)
- Collator holds onto its Intl.Collator now so it can be used across multiple compares
- Add Collator to values.js
- Fix parse indexes
- Add/tidy comments
@hsivonen
Copy link

hsivonen commented Feb 17, 2023

I'm a developer working on Firefox's Gecko engine and on the collator component of ICU4X. I have identified the data for search collations as having a disproportionately large contribution to binary size and, therefore, am exploring ways to make the data size smaller.

The vast, vast majority of the usage of Intl.Collator with usage: "search" on the Web can be traced back to this PR, so understanding the use case here is relevant to understanding the impact of tweaks to the underlying data.

What kind of user-visible feature does mapbox-gl-js use the accent-insensitive comparison for? From the source code, it seems that not only are equal/not-equal filtering supported but also less-than/greater-than. The semantics of usage: "search" are such that the application is only supposed to pay attention to equal/not-equal. Less-than/greater-than is OK as an implementation detail e.g. in a data structure, but not OK as a user-visible ordering (usage: "sort" is appropriate for that). How does mapbox-gl-js use greater-than/less-than with usage: "search"?

Also, the comments above talk only about case and diacritic-insensitivity. Does mapbox-gl-js rely on the specifics of usage: "search" for scripts that don't have diacritics or case?

@ghost
Copy link

ghost commented Feb 20, 2023

I don't have anything to do with mapbox, but the same problem likely exists in maplibre (which is an open-source fork of mapbox, that I also use / contribute to).
I've also never used this feature, but I'm aware of its existence.

So take everything here with a grain of salt.

The vast, vast majority of the usage of Intl.Collator with usage: "search" on the Web can be traced back to this PR,

Just by number of code appearances, or by how often it is actually executed / ran?

I have identified the data for search collations as having a disproportionately large contribution to binary size

I have a hard time understanding the technical differences between "sort" and "search" (or why usage is even necessary). Why would "search" contribute more to binary size than "sort"?
Is this just a hint for performance reasons, or are there actual differences for the result?

(I can imagine problematic cases for >= and <=: shall these prefer sort-order or matching? But for < and > I don't see any ambiguity so I'm not sure what problem you perceive or why the usage matters)

What kind of user-visible feature does mapbox-gl-js use the accent-insensitive comparison for?
[...]
Does mapbox-gl-js rely on the specifics of usage: "search" for scripts that don't have diacritics or case?

This PR added the "collator"-expression to the mapbox style expressions, which is an interpreted programming language used by developers who are using mapbox to visualize their maps. These expressions are typically evaluated for each map feature (say, a building / a point-of-interest, ..) and control their look / label.

So how exactly this feature is used depends on each particular map which is designed/styled/created using mapbox.
The data available to the expressions (such as names of buildings) are also mostly controlled by the user through map data.

I could imagine that this collator feature is being used to compare feature names to that in some textbox or to highlight a specific building/country/.. by matching the name.

The corresponding docs (for the mapbox style expressions) are at:

You can check the unit tests in this PR to see how it would be used in practice.

I'd guess collator for comparisons other than "==" and "!=" were added for completeness, as it's hard to tell what might be of use for users (or how the language developers in the future).
Given the ambiguity I mentioned above, I assume it's an oversight that search collators are allowed for <= and >=.

However, I also doubt that collators are ever used for sorting (as each expression runs for a single feature, so there isn't anything useful to sort).
Currently there aren't loops or functions in mapbox style expressions, so the risk that someone is using these expressions for sorting items is slim (at best they'd probably sort a handful of strings).
But with the mapbox style expressions evolving over time, there might be future-cases where objects will be sorted using collators.

@zcorpan
Copy link

zcorpan commented Feb 20, 2023

I helped @hsivonen with collecting the data.

Just by number of code appearances, or by how often it is actually executed / ran?

By code appearances. In particular, response bodies for requests in httparchive (11.5m pages) that match the regexp \b(Intl\.Collator\([^\(\)]+\)) and where the matched group also matches

\busage[\'"]?\s*:\s*[\'"`]search\b

@hsivonen
Copy link

I have a hard time understanding the technical differences between "sort" and "search" (or why usage is even necessary).

My inference from the evidence that I've seen is that it started with the notion that for various Latin-script languages collations happen to already provide a diacritic-insensitive equality definition such that some things that are technically diacritics but on a per-language basis aren't treated as such are not ignored so that e.g. o and ö in Finnish, Swedish, Icelandic and Turkish can be considered unequal even when comparing diacritic-insensitively. Therefore, if you want diacritic-insensitivity where some diacritics are not treated as diacritics in a language-dependent way, you can reuse collation data.

Then it was observed that the concept could be extended to Arabic and Thai scripts in a way that doesn't reuse a sorting tailoring, so search collations become a separate thing with the script-level aspects going into a common search root. (Also, once sort and search got separated, some Latin-script languages got search behavior that diverges from sorting.)

Why would "search" contribute more to binary size than "sort"?

In sorting, there are two layers of data: The root collation and, optionally, a language-specific tailoring overlay. In search, there are logically three layers of data: the root for sorting, a search root overlaid on that, and then, optionally, a language-specific tailoring. However, implementations only admit two layers, so for each language that's supposed to reuse its sort tailoring for searching, you instead carry a search tailoring that contains a merged copy of the search root and a copy of the sort tailoring for the language.

The obvious solution of "allow three layers of data, then" would make ctrl/cmd-f perform worse in apps that use search collations for interactive incremental search (which motivates the data but isn't supported by the Web API). I don't think it's realistic for me to pursue that in the context of ICU4C, but depending on what I learn, I might pursue it in the context of ICU4X.

Is this just a hint for performance reasons, or are there actual differences for the result?

There are actual differences, but I'm trying to figure out if this mapbox-gl-js use case is meant to care about them. Depending on language, the differences for what becomes greater-than and what becomes less-than are so notable that usage: "search" is very inappropriate to use for user-visible sorting.

I could imagine that this collator feature is being used to compare feature names to that in some textbox or to highlight a specific building/country/.. by matching the name.
...
However, I also doubt that collators are ever used for sorting

Thanks. So far, it sounds like it would be tolerable for Arabic-script and Thai matching to become stricter (less insensitive to certain marks) when the resolved locale is a Latin-script language with its language-specific rules for what diacritics are not ignored in diacritic-insensitive search. (Again, English and French are do not have such language-specific rules.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cross-platform 📺 Requires coordination with Mapbox GL Native (style specification, rendering tests, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants