-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Count surrogate pair as single character #779
Conversation
String expression operators now count UTF-16 surrogate pairs as single characters instead of splitting them up into individual surrogates.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #779 +/- ##
==========================================
+ Coverage 92.60% 92.70% +0.09%
==========================================
Files 105 105
Lines 4638 4646 +8
Branches 1306 1312 +6
==========================================
+ Hits 4295 4307 +12
+ Misses 343 339 -4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudos on all the tests you wrote!
@@ -2826,7 +2826,7 @@ | |||
} | |||
}, | |||
"index-of": { | |||
"doc": "Returns the first position at which an item can be found in an array or a substring can be found in a string, or `-1` if the input cannot be found. Accepts an optional index from where to begin the search.", | |||
"doc": "Returns the first position at which an item can be found in an array or a substring can be found in a string, or `-1` if the input cannot be found. Accepts an optional index from where to begin the search. In a string, a UTF-16 surrogate pair counts as a single position.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worded this a bit vaguely to leave open the possibility of adding support for grapheme clusters in the future. There are several potential real-world use cases for supporting grapheme clusters on maps, for example:
- The Esperanto letter
ĝ
has no precomposed character, so it must be represented as a base letter and a combining diacritic (U+0067 U+0302). - The Hangul syllable
각
may be decomposed as U+1100 U+1161 U+11A8. The best practice is to normalize it into a single precomposed character, U+AC01, but neither vector-tile-js nor maplibre-gl-js normalizes strings, and I’m unsure if any vector tile generator does either. - Some emoji sequences like
🇺🇳
appear as a single character but are composed of multiple underlying characters (U+1F1FA U+1F1F3).
In TypeScript, we could support grapheme clusters using the Intl.Segmenter
API, but Firefox only added support for it a few months ago, and I don’t know if it performs well enough for more common cases. On the native platforms, ICU has a similar API that might end up being the easiest solution for maplibre/maplibre-native#2730. I didn’t investigate it further, because we don’t support rendering grapheme clusters directly yet. However, @wipfli’s work on Indic text may create a need for it in the future.
CC: @louwers - I don't think this is a dramatic change, more in the realm of a bug fix, but we need to make sure this is OK with native. |
The string expression operators
index-of
,length
, andslice
now count UTF-16 surrogate pairs as single characters instead of splitting them up into individual surrogates. Also added unit tests of these expression operators.Fixes #778.
Launch Checklist
CHANGELOG.md
under the## main
section.