-
-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index by grapheme #2458
Index by grapheme #2458
Conversation
in Bengali should be ক্রিসেন্ট লেক (diacritics have broken continuity, thanks @thehoneymad for the pointer) |
in Khmer is wrong. Should be ក្រុងសៀមរាប |
For Telugu seeing clipping. Should be చుండూరు |
Looks very promising! So is tinysdf used in maplibre-gl-js since a long time already to go from fonts to SDFs? Where does the font data come from in this case? It is not yet working in firefox, as it is a chrome only API right now. Though it can be polyfilled probably.
|
TinySDF is used to render CJK in the client. It uses system fonts and the canvas element. @bdon told me that there are 10k CJK characters and the used ones are not close together in unicode numbers |
See also the discussion in maplibre/maplibre-style-spec#145 (reply in thread) maplibre/maplibre-native#778 (reply in thread) that led to this PR. |
System fonts have been a long-requested feature: mapbox/mapbox-gl-native#7862. This PR overcomes the two hurdles blocking that feature: the overreliance on Unicode codepoints and the inability to include combining characters in the same glyph. Besides complex text support, leveraging the browser’s text layout engine also allows font fallback per glyph for free without forcing the style to specify a pan-Unicode fallback font. With some minor tweaks to make TinySDF’s font usage more flexible, this library could even support Web fonts for a more modern alternative to the fontstack mechanism.
There has been a patch for Gecko for a few years but it got stalled due to the size it adds to the browser. There are several JavaScript libraries that claim to implement the same Unicode algorithm as
I think it’s important to set expectations for now: this approach still relies on slicing and dicing the string, just not as granularly as before. But grapheme clusters can still foil important complex text features such as initial/medial/final character forms, since these features don’t affect collation or text selection: maplibre/maplibre-native#778 (reply in thread). Maybe we can hack around it using joiner characters, but I don’t know. Assuming the result looks less broken than before, it would be wonderful to land this improvement, perhaps behind a runtime option like the existing option for CJK in TinySDF. But improving upon this might require either word-based segmentation (which looks rough on curved lines) or ditching the custom text shaper in favor of Harfbuzz – back to square one essentially. |
So that CanvasComparer class is super slow and I am positive that one can make it faster. So now the demo is a bit slow (a bit a lot) but it fixes the problems above: Also the clipping is fixed although that is probably just because I use TinySDF with a larger 200 px canvas (I changed this in the |
Let me know if there are still some bugs in the demo now! |
If we can make the CanvasComparer class more efficient, we can also drop the |
@bdon we will need your higher resolution TinySDF version if this should every be used for real. At the moment, the latin letters for example look very pixelated and I am sure it is the same for the other languages... |
If someone could have a look at the |
That is very nice, but on a line-placed label, wouldn’t that make the curvature less smooth, kind of choppy? Maybe in a line-placed label, when you detect a difference but it isn’t just one grapheme cluster, then you can break it apart just in case. |
For Latin text we can use the browser API to segment. By the way, here is a cool location to see some Burmese text along a line: https://wipfli.github.io/index-by-grapheme/#map=15.01/16.80186/96.17123 |
The memory usage is so intense that it sends MobileSafari into a crash loop. 🙈 |
Houps |
@1ec5 does it work now? I made it a bit faster by comparing only the parts of the canvases that actually have text. Also, I use a smaller font. |
I updated the code and removed the segmentation in the client. With the new version, I assume that the tiles contain strings which have marks between characters which should be treated as a cluster. I used the If the user inputs
The point labels in the demo have now such marked strings with the Here is the script I used to generate marked strings: https://github.com/wipfli/swiss-map/blob/main/planetiler/cluster/index.js It is a rudimentary script and some things would need to be improved, in particular it did not seem to get the Khmer labels right. But I am happy with the assumption that we start in MapLibre GL JS with text that contains explicit cluster information. |
That sounds OK, but if you generally expect tilesets to sprinkle zero-width spaces in text regardless of writing system, then you’re effectively forcing maplibre-gl-js/src/symbol/shaping.ts Line 352 in 972d4e6
I guess there is some precedent in that Mapbox Streets v8 now inserts zero-width spaces in names – but only in “text that is meant to be rendered on multiple lines”. What the documentation doesn’t say is that it’s also limited to certain writing systems, such as CJK, that don’t use spaces to segment words or break lines. Using it liberally on all writing systems would be a misuse of the character, as far as I can tell. There are also some unfortunate side-effects to expecting the server side to munge what would normally be human-readable text for presentational purposes. For example, it would interfere with any data-driven styling based on the same feature properties, and some feature querying code could also be affected. For example, the VoiceOver screenreader integration built into the iOS map SDK, which is based on feature querying, would begin spelling out the name of every POI unless the ZWSPs are stripped out. |
The line labels should work again now. I updated the tiles. On Thursday, May 11th, 2023 at 8 AM CEST we will have our next MapLibre Eastern Call and discuss text rendering there. Feel free to join. The zoom link is in the slack. |
Harfbuzz docs: https://harfbuzz.github.io/clusters.html says this about clusters and graphemes:
I should actually rename this pull request to "index by cluster"... |
The side-effects of having explicit joining characters can be mitigated by removing them before using text in expressions and voice-over. Ideally, we could have a default |
@1ec5 the demo was using |
@wipfli Try FontView to see HarfBuzz (+Raqm+FriBiDi) acting on a single font file. There’s no need for grapheme clustering in this code; before calling into HarfBuzz, Raqm asks FriBiDi for bidi runs, and Raqm has its own (small) logic for splitting script runs. There’s also a demo of HarfBuzz in a browser which might perhaps be more relevant for MapLibre GL JS, but it’s the same HarfBuzz library called underneath. |
A fun side-effect of generating the SDFs in the client is that we can use web fonts: We already have the Serifhttps://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=serif Monospacehttps://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=monospace |
From the same documentation:
There is still dataloss. In some languages like Thai, and to a lesser extent Chinese,1 ZWSPs or soft hyphens are typically used as word boundaries, analogous to the spaces in Latin. Overloading ZWSP to also represent a grapheme cluster boundary prevents GL JS from word-wrapping at ZWSPs as users expect. Stripping ZWSP from feature properties doesn’t solve this problem, but it does expand the problem, preventing natively rendered text from behaving correctly too. Footnotes
|
Yes. |
Thanks for the insight @1ec5. I think we can use a custom character to describe where joining should happen. Like that, we can avoid conflicts in Thai and other languages. |
If we somehow could encode the font used when doing the server-side text segmentation, then we could do really cool stuff like using Noto Nastaliq Urdu for Persian labels. Here is an example where I use a nastaliq font by default in tinysdf. As a result, all Arabic labels show up in nastaliq: // in tinysdf
- ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px ${fontFamily}`;
+ ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px 'Noto Nastaliq Urdu',Verdana,sans`; https://wipfli.github.io/index-by-grapheme/nastaliq/#map=5.09/31.62/68.39 |
Yes, that would be wonderful, also for distinguishing between Chinese and Japanese variants of the same Unicode codepoints. This presupposes that the tiles or TileJSON somehow indicate the language of the field(s) being inserted into |
Following the HarfBuzz simple shaping example (https://harfbuzz.github.io/a-simple-shaping-example.html), one needs the following ingredients for correct text shaping:
I think we can encode all of the above information in the tiles. A trivial way of doing it would be for example to use JSON strings like this one: {
"text": "Oliver",
"direction": "ltr",
"language": "en",
"script": "latin",
"font": "Noto Sans Regular"
} Note that in the canvas you cannot set the language, but one could use for example html-to-image https://www.npmjs.com/package/html-to-image instead of the canvas. Like that, we could do stuff like showing CJK in different languages:
I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq? |
In principle, yes, although GL JS has never made such detailed assumptions about the tiles’ contents up to now. Instead, it has relied on TileJSON (or the inline TileJSON inside the style JSON) to describe the tiles. I think it would be prudent to extend that approach rather than make implicit assumptions. For one thing, the most popular OSM-based tilesets contain multiple name fields in various languages, not to mention a generic There is a separation of concerns between TileJSON and the style JSON. Fonts are typically defined in the latter, and I mostly don’t see a reason to depart from that approach for this feature. The iOS SDK already interprets the
Clever library – it works by embedding the HTML element in an SVG document, creating an HTML image out of the SVG, and rendering the image into a canvas. But assuming that the whole glyph belongs to a single language, there’s a much simpler solution: just set the
I’m not entirely sure, but maybe HarfBuzz doesn’t maintain a mapping from language codes to default scripts? There are also plenty of edge cases, such as punctuation characters that don’t inherently belong to one script or another, but that different fonts might treat differently depending on the language. |
For better or worse, this is due to how OpenType works internally. No, it’s unrelated to Nastaliq. Rather, the script is a property of the Unicode sequence being rendered, as defined by Unicode Annex 24. Before calling HarfBuzz, you need to split the string into “script runs”, which are sequences of characters that have the same script, and call HarfBuzz separately for each run. For example, if a label For a general introduction, see Text layout is a loose hierarchy of segmentation. |
Fascinating stuff, thank @brawer! I think I will stick to Raqm because when we render text with the canvas object from javascript, we have basically the same api as Raqm, which is
So since we cannot set the script in a html canvas, we probably do not need to ship it to the client. |
Sounds wise. From what I can see, the main differences to Minikin are:
Regarding line breaking, @khaledhosny once wrote a branch for Raqm but according to HOST-Oman/libraqm#50 it won’t get merged because libunibreak is better at finding line breaking opportunities. However, I’m not sure if Raqm can already call libunibreak; Khaled would know best. Note that Minikin also does hyphenation (using LaTeX hyphenation dictionaries), whereas libunibreak just implements the Unicode line breaking algorithm. But the latter is probably good enough for rendering map labels. Regarding font fallback, it would be good to know how big an issue it really is for MapLibre. Since you’re already running Raqm on OpenStreetMap, can you count how many missing characters (glyph index zero) you see in the output glyph vectors? |
The homegrown line breaking code in GL JS implements LaTeX-style line balancing (not hyphenation), which as far as I know isn’t part of the Unicode line breaking algorithm. Line balancing keeps point-placed labels looking tidy. Without it, |
This would primarily be a consideration when using OSM’s local-language The status quo of server-side glyph rasterization all but forces the style designer to specify a pan-Unicode font as the last font in the font fallback list. (The iOS SDK removes this font from the list, in favor of the system font fallbacks, when rendering glyphs locally.) Even then, GL JS occasionally runs into its lack of support for non-BMP characters: mapbox/mapbox-gl-js#4001 (comment). |
@wipfli what's the status of this PR? |
I think this was a successful proof of concept. The next step would be to write a design proposal for the style specification. Do you think this is the right direction? |
Sure, my main question was about the reasons to keep this PR open... |
We can close it. The branch will continue to exist in my repo |
…ecially for latin scripts. Noto Sans Living with unicode ranges in CSS can be used for dynamic loading of script range from canvas/TinySDF
Sharing this code here because I like the idea so much of having better text support in MapLibre GL JS. It is a draft and for inspiration only at this point...
Demo: https://github.com/wipfli/index-by-grapheme
How does it work?
@
separator"Hallo"
->["H", "a", "l", "l", "o"]
"H@all@o"
->["Ha", "l", "lo"]
"H@a@llo"
->["Hal", "l", "o"]
What can it do?
It can render complex text on point labels and along lines.
Here are some languages:
Here are some cool cities:
And more:
What can it not do?
I don't know. Feel free to give some feedback if stuff does not work in your language...
Right-to-left languages like Hebrew and Arabic are not handled correctly.
CHANGELOG.md
under the## main
section.