Index by grapheme #2458

wipfli · 2023-04-29T21:15:32Z

Sharing this code here because I like the idea so much of having better text support in MapLibre GL JS. It is a draft and for inspiration only at this point...

Demo: https://github.com/wipfli/index-by-grapheme

How does it work?

Use always the TinySDF codepath in MapLibre GL JS (rasterize glyphs in the client)
Assume that the user has marked the text in the tiles such that characters that should be treated as a cluster are marked with an @ separator
- "Hallo" -> ["H", "a", "l", "l", "o"]
- "H@all@o" -> ["Ha", "l", "lo"]
- "H@a@llo" -> ["Hal", "l", "o"]
Use clusters to index the glyph atlas rather than unicode codepoints

What can it do?

It can render complex text on point labels and along lines.

Here are some languages:

Here are some cool cities:

And more:

Curved Telugu
Curved Marathi
Curved Burmese
Curved Dzongkha

What can it not do?

I don't know. Feel free to give some feedback if stuff does not work in your language...

Right-to-left languages like Hebrew and Arabic are not handled correctly.

Confirm your changes do not include backports from Mapbox projects (unless with compliant license) - if you are not sure about this, please ask!
Briefly describe the changes in this PR.
Link to related issues.
Include before/after visuals or gifs if this PR includes visual changes.
Write tests for all new functionality.
Document any changes to public APIs.
Post benchmark scores.
Add an entry to CHANGELOG.md under the ## main section.

wipfli · 2023-04-29T21:28:14Z

in Bengali should be ক্রিসেন্ট লেক (diacritics have broken continuity, thanks @thehoneymad for the pointer)

wipfli · 2023-04-29T22:52:09Z

in Khmer is wrong. Should be ក្រុងសៀមរាប

ramSeraph · 2023-04-30T04:47:29Z

For Telugu seeing clipping. Should be చుండూరు

maxammann · 2023-04-30T14:32:21Z

Looks very promising!

So is tinysdf used in maplibre-gl-js since a long time already to go from fonts to SDFs? Where does the font data come from in this case?

It is not yet working in firefox, as it is a chrome only API right now. Though it can be polyfilled probably.

Object { message: "Intl.Segmenter is not a constructor", stack: "" }

wipfli · 2023-04-30T14:38:02Z

TinySDF is used to render CJK in the client. It uses system fonts and the canvas element. @bdon told me that there are 10k CJK characters and the used ones are not close together in unicode numbers

1ec5 · 2023-04-30T15:40:15Z

See also the discussion in maplibre/maplibre-style-spec#145 (reply in thread) maplibre/maplibre-native#778 (reply in thread) that led to this PR.

1ec5 · 2023-04-30T16:02:57Z

It uses system fonts and the canvas element.

System fonts have been a long-requested feature: mapbox/mapbox-gl-native#7862. This PR overcomes the two hurdles blocking that feature: the overreliance on Unicode codepoints and the inability to include combining characters in the same glyph. Besides complex text support, leveraging the browser’s text layout engine also allows font fallback per glyph for free without forcing the style to specify a pan-Unicode fallback font. With some minor tweaks to make TinySDF’s font usage more flexible, this library could even support Web fonts for a more modern alternative to the fontstack mechanism.

It is not yet working in firefox, as it is a chrome only API right now. Though it can be polyfilled probably.

There has been a patch for Gecko for a few years but it got stalled due to the size it adds to the browser. There are several JavaScript libraries that claim to implement the same Unicode algorithm as Intl.Segmenter, but I guess they would add a similar amount to this library’s size.

in Bengali should be ক্রিসেন্ট লেক (diacritics have broken continuity

I think it’s important to set expectations for now: this approach still relies on slicing and dicing the string, just not as granularly as before. But grapheme clusters can still foil important complex text features such as initial/medial/final character forms, since these features don’t affect collation or text selection: maplibre/maplibre-native#778 (reply in thread). Maybe we can hack around it using joiner characters, but I don’t know.

Assuming the result looks less broken than before, it would be wonderful to land this improvement, perhaps behind a runtime option like the existing option for CJK in TinySDF. But improving upon this might require either word-based segmentation (which looks rough on curved lines) or ditching the custom text shaper in favor of Harfbuzz – back to square one essentially.

wipfli · 2023-04-30T16:28:27Z

Intl.Segmenter gets some graphemes wrong which leads to the bugs reported above. What I did now was I check that the sum of two graphemes ${grapheme1}${grapheme2} renders the same way as conequent calls to fillText with grapheme1, grapheme2, see the CanvasComparer class (thanks ChatGPT for writing it!).

So that CanvasComparer class is super slow and I am positive that one can make it faster. So now the demo is a bit slow (a bit a lot) but it fixes the problems above:

Also the clipping is fixed although that is probably just because I use TinySDF with a larger 200 px canvas (I changed this in the node_modules tinysdf index.js file...):

wipfli · 2023-04-30T16:29:12Z

Let me know if there are still some bugs in the demo now!

wipfli · 2023-04-30T16:31:23Z

If we can make the CanvasComparer class more efficient, we can also drop the Intl.Segmenter and just give the canvas comparer the individual codepoints. For the labels I looked at it worked, but it was even slower than the current version because the input of the canvas comparer was larger...

wipfli · 2023-04-30T16:33:07Z

@bdon we will need your higher resolution TinySDF version if this should every be used for real. At the moment, the latin letters for example look very pixelated and I am sure it is the same for the other languages...

wipfli · 2023-04-30T16:35:44Z

You can open the browser console to see what the graphemes are:

Interestingly, this approach will also do kerning for us. To is not the same as T and o...

wipfli · 2023-04-30T16:39:33Z

If someone could have a look at the CanvasComparer class and make it more efficient, it would be great. Help with this would be super welcome!

1ec5 · 2023-04-30T16:56:35Z

Interestingly, this approach will also do kerning for us. To is not the same as T and o...

That is very nice, but on a line-placed label, wouldn’t that make the curvature less smooth, kind of choppy? Maybe in a line-placed label, when you detect a difference but it isn’t just one grapheme cluster, then you can break it apart just in case.

wipfli · 2023-04-30T17:04:59Z

For Latin text we can use the browser API to segment.

By the way, here is a cool location to see some Burmese text along a line:

https://wipfli.github.io/index-by-grapheme/#map=15.01/16.80186/96.17123

1ec5 · 2023-04-30T17:07:32Z

The memory usage is so intense that it sends MobileSafari into a crash loop. 🙈

wipfli · 2023-04-30T17:25:45Z

Houps

wipfli · 2023-04-30T18:08:04Z

@1ec5 does it work now? I made it a bit faster by comparing only the parts of the canvases that actually have text. Also, I use a smaller font.

1ec5 · 2023-05-07T23:03:02Z

In chrome on ubuntu, it did unfortunately not work

Wow, that looks pretty ugly. I don’t know why break-all would realistically break on anything more granular than a grapheme cluster, since this property value is intended for display of human-readable text. It looks like possibly a bug in Blink or Harfbuzz.

This is what I see in various browsers I have readily available:

Safari 13.1 on macOS 10.13

Safari and all other browsers on iOS 16.4

Firefox 112.0 on macOS 10.13

SeaMonkey 2.53 (like Firefox 91.0) on macOS 10.13

Chrome 115.0 on macOS 10.13

Since Firefox seems to be handling break-all the best, maybe we can use it as a workaround for Firefox while other (modern versions of) browsers use Intl.Segmenter?

1ec5 · 2023-05-07T23:19:15Z

I am happy with the assumption that we start in MapLibre GL JS with text that contains explicit cluster information.

That sounds OK, but if you generally expect tilesets to sprinkle zero-width spaces in text regardless of writing system, then you’re effectively forcing word-break: break-all behavior in any point-placed label with text-max-width, in any style, because ZWSP is also a line-breaking opportunity:

maplibre-gl-js/src/symbol/shaping.ts

Line 352 in 972d4e6

[0x200b]: true, // zero-width space

I guess there is some precedent in that Mapbox Streets v8 now inserts zero-width spaces in names – but only in “text that is meant to be rendered on multiple lines”. What the documentation doesn’t say is that it’s also limited to certain writing systems, such as CJK, that don’t use spaces to segment words or break lines. Using it liberally on all writing systems would be a misuse of the character, as far as I can tell.

There are also some unfortunate side-effects to expecting the server side to munge what would normally be human-readable text for presentational purposes. For example, it would interfere with any data-driven styling based on the same feature properties, and some feature querying code could also be affected. For example, the VoiceOver screenreader integration built into the iOS map SDK, which is based on feature querying, would begin spelling out the name of every POI unless the ZWSPs are stripped out.

wipfli · 2023-05-08T14:56:53Z

The line labels should work again now. I updated the tiles.

On Thursday, May 11th, 2023 at 8 AM CEST we will have our next MapLibre Eastern Call and discuss text rendering there. Feel free to join. The zoom link is in the slack.

wipfli · 2023-05-08T14:59:16Z

Intl.Segmenter gives you graphemes, but not clusters, so using this browser API does not solve our problem.

Harfbuzz docs: https://harfbuzz.github.io/clusters.html says this about clusters and graphemes:

In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible unit. A single letter or symbol can be a cluster of its own. Other clusters correspond to longer subsequences of the input code points — such as a ligature or conjunct form — and require the shaper to ensure that the cluster is not broken during the shaping process.

A cluster is distinct from a grapheme, which is the smallest unit of meaning in a writing system or script.

I should actually rename this pull request to "index by cluster"...

wipfli · 2023-05-08T15:01:50Z

The side-effects of having explicit joining characters can be mitigated by removing them before using text in expressions and voice-over.

Ideally, we could have a default customJoiningCharacter but also offer a style-spec property to let the user specify it.

wipfli · 2023-05-08T15:03:17Z

@1ec5 the demo was using OffscreenCanvas, which is probably why it did not work on your iPhone 8. Now I removed the OffscreenCanvas and us the normal DOM canvas again. Does it work for you now?

brawer · 2023-05-08T15:24:21Z

@wipfli Try FontView to see HarfBuzz (+Raqm+FriBiDi) acting on a single font file. There’s no need for grapheme clustering in this code; before calling into HarfBuzz, Raqm asks FriBiDi for bidi runs, and Raqm has its own (small) logic for splitting script runs. There’s also a demo of HarfBuzz in a browser which might perhaps be more relevant for MapLibre GL JS, but it’s the same HarfBuzz library called underneath.

wipfli · 2023-05-09T21:11:55Z

A fun side-effect of generating the SDFs in the client is that we can use web fonts:

We already have the map.localIdeographFontFamily option. I've used this one to let the user configure the font family in the demo via the fontFamily part in the URL:

Serif

https://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=serif

Monospace

https://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=monospace

1ec5 · 2023-05-09T21:59:06Z

Intl.Segmenter gives you graphemes, but not clusters, so using this browser API does not solve our problem.

Intl.Segmenter gives you “grapheme clusters”, which Harfbuzz calls “clusters” for short. If it gave you just graphemes, it would be equivalent to passing the empty string into String.prototype.split.

From the same documentation:

For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine — even though the two original, underlying letters remain separate graphemes.

Intl.Segmenter won’t always give you perfect results. It has no context about the font, and there are different interpretations of what should constitute a (grapheme) cluster, for example based on whether the font happens to create a ligature at a given font size. I view the whole grapheme cluster idea as a stopgap, but one that’s less onerous on both tileset generators and application developers than littering the text with ZWSPs.

The side-effects of having explicit joining characters can be mitigated by removing them before using text in expressions and voice-over.

There is still dataloss. In some languages like Thai, and to a lesser extent Chinese,¹ ZWSPs or soft hyphens are typically used as word boundaries, analogous to the spaces in Latin. Overloading ZWSP to also represent a grapheme cluster boundary prevents GL JS from word-wrapping at ZWSPs as users expect. Stripping ZWSP from feature properties doesn’t solve this problem, but it does expand the problem, preventing natively rendered text from behaving correctly too.

Well-typeset maps in Chinese avoid breaking up character compounds (which are mostly two or three Chinese characters long). Otherwise, it’s very easy for a label to accidentally say something very naughty or even illegal if the reader doesn’t realize the lexeme has been broken apart. ↩

1ec5 · 2023-05-09T22:05:26Z

the demo was using OffscreenCanvas, which is probably why it did not work on your iPhone 8. Now I removed the OffscreenCanvas and us the normal DOM canvas again. Does it work for you now?

Yes.

wipfli · 2023-05-10T07:20:34Z

Thanks for the insight @1ec5. I think we can use a custom character to describe where joining should happen. Like that, we can avoid conflicts in Thai and other languages.

wipfli · 2023-05-17T07:29:58Z

If we somehow could encode the font used when doing the server-side text segmentation, then we could do really cool stuff like using Noto Nastaliq Urdu for Persian labels.

Here is an example where I use a nastaliq font by default in tinysdf. As a result, all Arabic labels show up in nastaliq:

// in tinysdf
-  ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px ${fontFamily}`;
+  ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px 'Noto Nastaliq Urdu',Verdana,sans`;

https://wipfli.github.io/index-by-grapheme/nastaliq/#map=5.09/31.62/68.39

1ec5 · 2023-05-19T14:55:43Z

If we somehow could encode the font used when doing the server-side text segmentation, then we could do really cool stuff like using Noto Nastaliq Urdu for Persian labels.

Yes, that would be wonderful, also for distinguishing between Chinese and Japanese variants of the same Unicode codepoints. This presupposes that the tiles or TileJSON somehow indicate the language of the field(s) being inserted into text-field – or maybe that the style indicate the language, in the case of an Americana-like code generation mechanism.

wipfli · 2023-05-20T08:45:40Z

Following the HarfBuzz simple shaping example (https://harfbuzz.github.io/a-simple-shaping-example.html), one needs the following ingredients for correct text shaping:

the text itself
the direction, language, and script
the font

I think we can encode all of the above information in the tiles. A trivial way of doing it would be for example to use JSON strings like this one:

{
  "text": "Oliver",
  "direction": "ltr",
  "language": "en",
  "script": "latin",
  "font": "Noto Sans Regular"
}

Note that in the canvas you cannot set the language, but one could use for example html-to-image https://www.npmjs.com/package/html-to-image instead of the canvas. Like that, we could do stuff like showing CJK in different languages:

<span lang="zh-Hant">令</span> -> 令
<span lang="ja">令</span> -> 令

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

1ec5 · 2023-05-20T09:30:47Z

I think we can encode all of the above information in the tiles.

In principle, yes, although GL JS has never made such detailed assumptions about the tiles’ contents up to now. Instead, it has relied on TileJSON (or the inline TileJSON inside the style JSON) to describe the tiles. I think it would be prudent to extend that approach rather than make implicit assumptions. For one thing, the most popular OSM-based tilesets contain multiple name fields in various languages, not to mention a generic name field whose language is undetermined. The TileJSON could include an object that maps properties to their languages.

There is a separation of concerns between TileJSON and the style JSON. Fonts are typically defined in the latter, and I mostly don’t see a reason to depart from that approach for this feature. The iOS SDK already interprets the text-font property as either a fontstack (for server-side rendering) or a list of local font names (for client-side rendering). By analogy, you’d just set ctx.font to the evaluated text-font value.

Note that in the canvas you cannot set the language, but one could use for example html-to-image https://www.npmjs.com/package/html-to-image instead of the canvas. Like that, we could do stuff like showing CJK in different languages:

Clever library – it works by embedding the HTML element in an SVG document, creating an HTML image out of the SVG, and rendering the image into a canvas.

But assuming that the whole glyph belongs to a single language, there’s a much simpler solution: just set the <canvas> element’s lang attribute to the text’s language, and the browser will select the fonts accordingly. For example, try changing zh to ja in this demo; you should see the inner strokes of 海 change just as in https://github.com/ZeLonewolf/openstreetmap-americana/issues/613#issuecomment-1378543457. (In your TinySDF-based demo, it would probably involve setting ctx.canvas.lang.)

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

I’m not entirely sure, but maybe HarfBuzz doesn’t maintain a mapping from language codes to default scripts? There are also plenty of edge cases, such as punctuation characters that don’t inherently belong to one script or another, but that different fonts might treat differently depending on the language.

brawer · 2023-05-20T20:36:49Z

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

For better or worse, this is due to how OpenType works internally. No, it’s unrelated to Nastaliq. Rather, the script is a property of the Unicode sequence being rendered, as defined by Unicode Annex 24. Before calling HarfBuzz, you need to split the string into “script runs”, which are sequences of characters that have the same script, and call HarfBuzz separately for each run. For example, if a label ローソンATM is tagged with language ja, you’ll have to call HarfBuzz twice: Once for ローソン with script Kana and language ja, and once for ATM with script Latn and (again) language ja. The process of splitting text into script runs is called “script itemization”. There’s some subtleties around punctuation and Emoji, and unfortunately, the algorithm has never been formally defined. My personal recommendation would be to leave this all up to a higher-level library like Raqm or Minikin. If you really have to implement it yourself, check out what Raqm does in _raqm_itemize() and _raqm_resolve_scripts(). Should you really want to go down this rabbit hole, w3c/font-text-cg#37 might be a good starting point, and this comparison of existing implementations.

For a general introduction, see Text layout is a loose hierarchy of segmentation.

wipfli · 2023-05-22T06:24:28Z

Fascinating stuff, thank @brawer! I think I will stick to Raqm because when we render text with the canvas object from javascript, we have basically the same api as Raqm, which is

text
language
direction
font

So since we cannot set the script in a html canvas, we probably do not need to ship it to the client.

brawer · 2023-05-22T07:23:38Z

I think I will stick to Raqm [instead of building a custom text rendering stack]

Sounds wise. From what I can see, the main differences to Minikin are:

line breaking;
font fallback.

Regarding line breaking, @khaledhosny once wrote a branch for Raqm but according to HOST-Oman/libraqm#50 it won’t get merged because libunibreak is better at finding line breaking opportunities. However, I’m not sure if Raqm can already call libunibreak; Khaled would know best. Note that Minikin also does hyphenation (using LaTeX hyphenation dictionaries), whereas libunibreak just implements the Unicode line breaking algorithm. But the latter is probably good enough for rendering map labels.

Regarding font fallback, it would be good to know how big an issue it really is for MapLibre. Since you’re already running Raqm on OpenStreetMap, can you count how many missing characters (glyph index zero) you see in the output glyph vectors?

1ec5 · 2023-05-22T09:43:59Z

Note that Minikin also does hyphenation (using LaTeX hyphenation dictionaries), whereas libunibreak just implements the Unicode line breaking algorithm. But the latter is probably good enough for rendering map labels.

The homegrown line breaking code in GL JS implements LaTeX-style line balancing (not hyphenation), which as far as I know isn’t part of the Unicode line breaking algorithm. Line balancing keeps point-placed labels looking tidy. Without it, text-max-width effectively determines the length of the first line of the label, which mostly defeats the purpose of line wrapping.

1ec5 · 2023-05-22T09:53:14Z

Regarding font fallback, it would be good to know how big an issue it really is for MapLibre. Since you’re already running Raqm on OpenStreetMap, can you count how many missing characters (glyph index zero) you see in the output glyph vectors?

This would primarily be a consideration when using OSM’s local-language name key throughout the world, as in OSM Americana or any style that aims to reproduce openstreetmap-carto.

The status quo of server-side glyph rasterization all but forces the style designer to specify a pan-Unicode font as the last font in the font fallback list. (The iOS SDK removes this font from the list, in favor of the system font fallbacks, when rendering glyphs locally.)

Even then, GL JS occasionally runs into its lack of support for non-BMP characters: mapbox/mapbox-gl-js#4001 (comment).

HarelM · 2023-06-28T18:04:12Z

@wipfli what's the status of this PR?

wipfli · 2023-06-29T07:17:26Z

I think this was a successful proof of concept. The next step would be to write a design proposal for the style specification.

Do you think this is the right direction?

HarelM · 2023-06-29T20:15:29Z

Sure, my main question was about the reasons to keep this PR open...

wipfli · 2023-06-30T07:01:40Z

We can close it. The branch will continue to exist in my repo

…ecially for latin scripts. Noto Sans Living with unicode ranges in CSS can be used for dynamic loading of script range from canvas/TinySDF

wipfli added 4 commits April 29, 2023 12:22

Use strings for unicode numbers

ee1d53e

Add comments

a5d4fcc

Step 1: Use strings as indices

fc05d73

Add example

94a0597

wipfli marked this pull request as draft April 29, 2023 21:16

Index by word

fa63d46

Add canvas comparer

04b6751

Optimize

c68cde0

wipfli added 4 commits April 30, 2023 20:32

Optimize more by single pass joining

aff6cf4

Optimize with static canvas

cf1606b

Optimize with 3 px font size

af96623

Only merge graphemes on non-latin text

47e1a19

Use roadnames data source with marked strings

f3fe44c

HarelM mentioned this pull request May 11, 2023

Character alignment issue when CJK glyphs are mixed with other characters. #1051

Open

wipfli closed this Jun 30, 2023

1ec5 mentioned this pull request Aug 19, 2024

Render complex text, variant forms, emoji, etc. 1ec5/maplibre-gl-js#1

Draft

HarelM mentioned this pull request Dec 23, 2024

Support for Displaying Japanese Long Vowel Mark (ー) in Vertical Layout #5259

Open

Index by grapheme #2458

Index by grapheme #2458

Conversation

wipfli commented Apr 29, 2023 • edited Loading

How does it work?

What can it do?

What can it not do?

wipfli commented Apr 29, 2023 • edited Loading

wipfli commented Apr 29, 2023

ramSeraph commented Apr 30, 2023 • edited Loading

maxammann commented Apr 30, 2023

wipfli commented Apr 30, 2023

1ec5 commented Apr 30, 2023

1ec5 commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

1ec5 commented Apr 30, 2023

wipfli commented Apr 30, 2023

1ec5 commented Apr 30, 2023

wipfli commented Apr 30, 2023

wipfli commented Apr 30, 2023

1ec5 commented May 7, 2023

1ec5 commented May 7, 2023 • edited Loading

wipfli commented May 8, 2023

wipfli commented May 8, 2023

wipfli commented May 8, 2023

wipfli commented May 8, 2023

brawer commented May 8, 2023 • edited Loading

wipfli commented May 9, 2023

Serif

Monospace

1ec5 commented May 9, 2023 • edited Loading

Footnotes

1ec5 commented May 9, 2023

wipfli commented May 10, 2023

wipfli commented May 17, 2023

1ec5 commented May 19, 2023

wipfli commented May 20, 2023

1ec5 commented May 20, 2023 • edited Loading

brawer commented May 20, 2023 • edited Loading

wipfli commented May 22, 2023

brawer commented May 22, 2023

1ec5 commented May 22, 2023 • edited Loading

1ec5 commented May 22, 2023

HarelM commented Jun 28, 2023

wipfli commented Jun 29, 2023

HarelM commented Jun 29, 2023

wipfli commented Jun 30, 2023

wipfli commented Apr 29, 2023 •

edited

Loading

wipfli commented Apr 29, 2023 •

edited

Loading

ramSeraph commented Apr 30, 2023 •

edited

Loading

1ec5 commented May 7, 2023 •

edited

Loading

brawer commented May 8, 2023 •

edited

Loading

1ec5 commented May 9, 2023 •

edited

Loading

1ec5 commented May 20, 2023 •

edited

Loading

brawer commented May 20, 2023 •

edited

Loading

1ec5 commented May 22, 2023 •

edited

Loading