Implement Unicode combining characters #2388

ZeLonewolf · 2023-04-13T17:30:02Z

This PR implements unicode combining characters so that diacritic characters in the Unicode combiner block (0x300-0x36F) are rendered with preceding characters. This code tests to see if a character is combining, and if so, it inspects the dimensions of the prior character and renders it in position accordingly.

Posting as draft for collaboration. I'm not sure if this introduces side effects and I haven't look into unit testing options.

Note that this does not address #50 which requires a more comprehensive solution.

Samples:

name: sməqʷəʔelə həw̓aləm̓ew̓txʷ (OSM location)

Before:

After:

Location: https://zelonewolf.github.io/openstreetmap-americana/#map=18/49.02475/-123.10267&language=mul

HarelM · 2023-04-13T18:16:46Z

THANKS! looks pretty straight forward.
Are the combining characters value is a convention (saw ilthat you added it as hard coded 300 to 36f)?
Other than that, a unit test and a render test would be needed here to make sure we keep this behavior correct in the future.
Would also be interested to see what happens for non straight text and RTL text such as:
שָׁלוֹם.
Regardless, I think this is a good step forward even if it doesn't cover all cases.

ZeLonewolf · 2023-04-13T18:28:01Z

Are the combining characters value is a convention (saw ilthat you added it as hard coded 300 to 36f)?

Yes, it's a specific Unicode blocks for combining diacritics. I initially tried to infer this from negative font metrics, which resulted in some comical results such as:

That's because the cursive "f" has a tail that hangs into negative space.

Perhaps what I should have done is check for metrics.advance==0 but I'm increasingly convinced that the safest thing for this narrow issue to explicitly test for characters that we know for absolute certain are supposed to be handled in this way.

ZeLonewolf · 2023-04-13T18:32:37Z

שָׁלוֹם

Based on unicode inspector, that text doesn't have any combining diacriticals. I'll run an overpass query and see if we can find some other examples.

ZeLonewolf · 2023-04-13T18:51:03Z

Overpass query for unicode combining characters:

[out:csv(::id, name)][timeout:2500];
node[name~"[\u0300-\u036E]"];
out;

I'm having trouble locating RTL examples (at least on objects that Americana renders).

HarelM · 2023-04-13T18:51:50Z

Interesting! Thanks for the info.
The "nikud" in Hebrew definitely falls into that category, but it's not common as it is related to Hebrew, which is not a commonly used language, I guess.
Would be interesting to see if this range can be configured so that it can support other characters and have a default range maybe, IDK, just a thought...

HarelM · 2023-04-13T18:55:51Z

As far as I know in Hebrew these common characters are not used, and I also think Arabic has different characters as well (tick above, below, a 9 on the side), so it would make sense you won't find them, also they are usually omitted when writing in both Arabic and Hebrew (they are not "mandatory", so they kinda get neglected, I think).
For example, the above word is usually written as שלום,
Without the extra nikud characters.

ZeLonewolf · 2023-04-13T19:56:46Z

Yep, I found the list of Hebrew and Arabic combiners. They're called something different in unicode (accents, points, etc) but the concept seems the same. This approach doesn't work out of the box with the RTL code points on initial testing, so I think I need to add some RTL detection and flip the direction.

HarelM · 2023-04-13T20:20:07Z

Feel free to split this into two if it makes sense from your end - the split can be without RTL support and with RTL support (assuming not supporting RTL doesn't make things worse).
Thanks!!

1ec5 · 2023-04-13T20:52:56Z

src/symbol/shaping.ts

+function isCombiningCharacter(codePoint: number): boolean {
+    return codePoint >= 0x0300 && codePoint <= 0x036F;
+}


You can avoid these magic numbers by checking whether the codepoint is in one of these ranges:

maplibre-gl-js/src/util/is_char_in_unicode_block.ts

Line 6 in e50a264

const unicodeBlockLookup: UnicodeBlockLookup = {

As discussed above, this is only one of the Unicode blocks dedicated to combining characters, but other combining characters are scattered elsewhere in other blocks. To detect a combining character more reliably, check whether the character’s general property in the Unicode character database is Mark or a subcategory thereof. Note that some marks are spacing, unlike the characters we’ve considered so far.

If this library’s minimum supported browser configuration supports the Unicode flag in regular expressions, you can simply match on a property escape:

Suggested change

function isCombiningCharacter(codePoint: number): boolean {

return codePoint >= 0x0300 && codePoint <= 0x036F;

}

function isCombiningCharacter(codePoint: number): boolean {

return !/\P{M}/u.test(String.fromCodePoint(codePoint));

}

Also, move this utility function to src/util/script_detection.ts, which has lots of similar character range tests. I’m pretty sure that file can be simplified considerably if we’re able to use the Unicode flag in regular expressions. When Mapbox first implemented some of this functionality, we still had to support some older browsers that lacked support for this flag.

This captures both LTR and RTL combing characters, but this PR only works for LTR so far.

You could combine this check with an RTL check if necessary:

maplibre-gl-js/src/util/script_detection.ts

Lines 278 to 279 in e50a264

export function charInRTLScript(char: number) {

// Main blocks for Hebrew, Arabic, Thaana and other RTL scripts

With property escapes, this method can probably also be simplified to test for \p{Bidi_Class=Right_To_Left}. But either way, you’d have to test the surrounding text, not just the character in question, because some characters are used in both LTR and RTL text.

With property escapes, this method can probably also be simplified to test for \p{Bidi_Class=Right_To_Left}.

Apparently Bidi_Class is not exposed in ECMAScript yet, even though there are some other simplification opportunities along these lines: #4541.

src/symbol/shaping.ts

1ec5 · 2023-04-13T20:59:22Z

src/symbol/shaping.ts

+                let retreat = 0;
+
+                if (isCombiningCharacter(codePoint)) {
+                    retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width;


The line should retreat if left is negative, even if the character is non-combining. This is useful for spacing characters that overshoot the standard bounds. For example, in many fonts, the letter “f” has a descender that extends to the left of the logical origin. Without an affordance for this case, the kerning gets wonky in italic text.

This sounds logical but it doesn't produce the expected result.
Before/After:

left should also affect the current glyph’s position, not just the line’s advance. I haven’t tried it myself, but I’m guessing the before screenshot above looks OK because there’s enough letter spacing to obscure the poor kerning.

In the after screen shot above, left only affects the glyph's position, not the line's advance. The advance is taken directly from font metrics.

The after screenshot above doesn’t make sense to me. Why would adding a negative left to the x position shift the glyph to the right? I agree that the rest of this conditional block would be irrelevant to a non-combining character; prevMetrics wouldn’t matter in that case.

Here's my current (uncommitted) code that's causing this effect:

if (!vertical) { let yPos = y + baselineOffset; let xPos = x; if (isCombiningCharacter(codePoint)) { const retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width; xPos += retreat * section.scale - spacing; yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height); } else { xPos += metrics.left * section.scale; } positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect}); if (metrics.advance > 0) { x += metrics.advance * section.scale + spacing; } } else {

1ec5 · 2023-04-13T21:00:56Z

src/symbol/shaping.ts

+                }
+
+                const xPos = x + retreat * section.scale;
+                positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
                x += metrics.advance * section.scale + spacing;
            } else {
                shaping.verticalizable = true;
                positionedGlyphs.push({glyph: codePoint, imageName, x, y: y + baselineOffset, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});


A negative offset matters in vertical text too. Depending whether the previous character has been rotated, this character needs to be offset either up or to the left.

1ec5 · 2023-04-14T15:32:06Z

src/symbol/shaping.ts

+
+                if (isCombiningCharacter(codePoint)) {
+                    retreat = (prevMetrics?.width ?? 0) * metrics.left / metrics.width - spacing;
+                    yPos += section.scale * (metrics.height - prevMetrics?.height ?? metrics.height);


I don’t think we can assume that a combining character is always on top of the base letter by the height of the diacritic. Many diacritics go under the base letter or overlap it, and this font (Noto) places its superscript diacritics closer to the base letter.

Fonts usually come with information about where to place diacritics relative to a given base letter glyph. If we need to keep superscript diacritics from touching the base letter, we might have to hard-code the specific ones to shift upward and by how much…

Here's what I'm seeing for glyph information.

Preceeding character:

{"rect":{"x":144,"y":165,"w":29,"h":22},"metrics":{"width":21,"height":14,"left":0,"top":-6,"advance":20}}

Combiner character:

{"rect":{"x":195,"y":165,"w":14,"h":15},"metrics":{"width":6,"height":7,"left":-3,"top":-2,"advance":0}}

Is there enough information here to position the combiner? If so, what should the logic be?

TrueType and OpenType fonts come with anchors on both the base letters and the combining diacritics, which presumably are used to join up the glyphs intelligently. I don’t know much more offhand about how that works. Here’s FontForge showing U+006D and U+0301 in Noto Sans Regular:

The anchor positions vary by glyph and by font. Unfortunately, it doesn’t look like the glyph PBF format preserves any of this information. It would be essential for #50 but possibly out of scope for a quick fix for combining diacritics. Even so, in my opinion, allowing the diacritic to run into the base letter is still better than spacing it out, which looks a lot rawer.

Here's a JSON of codepoint 0x0313 "combining comma above". Any idea where the anchor data is?
font0313.txt

This JSON file contains TrueType and OpenType tables. Most of the tables are the usual things like glyph data and horizontal metrics that we’re already making use of, but GPOS contains advanced glyph positioning metrics. It looks like the key to accessing the attachment points, not only for combining diacritics but also for some complex text forms, such as in Urdu. Also, OS/2 contains various character metrics that Windows uses but other platforms do not, which might be useful for reconstructing the metrics we need, but aren’t guaranteed to be present.

Finally, something with some &^%$#@! documentation.

1ec5 · 2023-04-14T15:37:55Z

src/symbol/shaping.ts

+                const xPos = x + retreat * section.scale;
+                positionedGlyphs.push({glyph: codePoint, imageName, x: xPos, y: yPos, vertical, scale: section.scale, fontStack: section.fontStack, sectionIndex, metrics, rect});
+                x += metrics.advance * section.scale;
+                if (!isCombiningCharacter(codePoint)) {


This code might need to be updated to reflect the changes you’re making to the advance. Since it doesn’t seem to be affecting the glyph layout in your examples, I’m guessing it’s only used for line justification/line wrapping, but that does play into symbol collision:

maplibre-gl-js/src/symbol/shaping.ts

Lines 374 to 377 in 6d55d5b

const positions = glyphMap[section.fontStack];

const glyph = positions && positions[codePoint];

if (!glyph) return 0;

return glyph.metrics.advance * section.scale + spacing;

HarelM · 2023-04-14T18:21:41Z

Don't forget to add render tests with all the great examples here - these are great for ATDD (Acceptance test driven development tests)! :-)

ZeLonewolf · 2023-04-14T18:44:48Z

Don't forget to add render tests with all the great examples here - these are great for ATDD (Acceptance test driven development tests)! :-)

Don't get too excited until we actually get this thing working 😅

HarelM · 2023-04-14T19:01:58Z

Fair enough :-)

ZeLonewolf · 2023-05-02T20:06:38Z

Just an update that I'm still working on this. I've got some test code that is successfully parsing a font file using the OpenType spec in order to extract just the shaping information (using Google Noto regular). I would then produce a JSON file that we could parse in order to do shaping. In theory this approach would work "perfectly" if you happen to use Google Noto regular but be slightly off for any other font series (but should still be better than present). So far this approach seems feasible (but still not a replacement for proper font support). I would like to ask if there is any issue from a licensing perspective from extracting font metric information from these font stacks and incorporating them into this repository.

1ec5 · 2023-05-02T20:22:39Z

I would like to ask if there is any issue from a licensing perspective from extracting font metric information from these font stacks and incorporating them into this repository.

That’s a good question. The architecture around fontstacks is overkill compared to leveraging Web fonts on each platform, but I assume the design limits the impact of licensing issues. In the U.S., typefaces and their characters themselves are ineligible for copyright protection, but a scalable font is copyrightable. So if you have enough rights to use the font yourself but not redistribute it, then you could apply the font on the server side and distribute a screenshot of each glyph that the client would only be responsible for piecing together. On the other hand, if the PBFs were to contain the vector data of each glyph’s paths, then the font’s license would be enforceable. Your question might boil down to whether distributing the font metrics similarly ventures into scalable font territory. It’s probably a good idea to limit the metrics to just what’s necessary, if not for licensing reasons then for performance.

I don’t know if anyone is taking advantage of this workaround in the fontstack architecture, given the popularity of fonts like Noto that allow redistribution under the SIL Open Font License.

HarelM · 2023-06-28T18:03:39Z

Is there a plan to push this PR forward?
I would like to clean up the PR list so I can concentrate on those the require my attention.
You can always open a new PR if needed.

ZeLonewolf · 2023-06-28T19:39:08Z

Unfortunately this ended up being more complicated than I expected and it's going to take more work to investigate and do it properly. I don't think it makes sense to implement what's essentially a hack that will work for some combiners but be wrong for others. So I'll close this for now.

ZeLonewolf added 3 commits April 13, 2023 13:15

Implement combining characters

6d30a26

Merge branch 'main' into zlw-font-combiner

c5341ec

Fix null pointer if first character is a combiner

7abd57b

1ec5 reviewed Apr 13, 2023

View reviewed changes

Fix spacing application

7c1f5d1

1ec5 reviewed Apr 14, 2023

View reviewed changes

1ec5 mentioned this pull request Apr 29, 2023

Combining characters rendering in wrong place #2384

Open

ZeLonewolf closed this Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Unicode combining characters #2388

Implement Unicode combining characters #2388

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

ZeLonewolf commented Apr 13, 2023

ZeLonewolf commented Apr 13, 2023 •

edited

Loading

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

HarelM commented Apr 13, 2023 •

edited

Loading

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

1ec5 Apr 13, 2023 •

edited

Loading

1ec5 Apr 13, 2023 •

edited

Loading

ZeLonewolf Apr 13, 2023

1ec5 Apr 13, 2023

1ec5 Aug 12, 2024

1ec5 Apr 13, 2023

ZeLonewolf Apr 13, 2023

1ec5 Apr 13, 2023 •

edited

Loading

ZeLonewolf Apr 13, 2023

1ec5 Apr 14, 2023

ZeLonewolf Apr 14, 2023

1ec5 Apr 13, 2023

1ec5 Apr 14, 2023

ZeLonewolf Apr 15, 2023

1ec5 Apr 15, 2023

ZeLonewolf Apr 15, 2023

1ec5 Apr 27, 2023 •

edited

Loading

ZeLonewolf Apr 30, 2023

1ec5 Apr 14, 2023

HarelM commented Apr 14, 2023

ZeLonewolf commented Apr 14, 2023

HarelM commented Apr 14, 2023

ZeLonewolf commented May 2, 2023 •

edited

Loading

1ec5 commented May 2, 2023 •

edited

Loading

HarelM commented Jun 28, 2023

ZeLonewolf commented Jun 28, 2023

	export function charInRTLScript(char: number) {
	// Main blocks for Hebrew, Arabic, Thaana and other RTL scripts

	const positions = glyphMap[section.fontStack];
	const glyph = positions && positions[codePoint];
	if (!glyph) return 0;
	return glyph.metrics.advance * section.scale + spacing;

Implement Unicode combining characters #2388

Implement Unicode combining characters #2388

Conversation

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

ZeLonewolf commented Apr 13, 2023

ZeLonewolf commented Apr 13, 2023 • edited Loading

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

HarelM commented Apr 13, 2023 • edited Loading

ZeLonewolf commented Apr 13, 2023

HarelM commented Apr 13, 2023

1ec5 Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

1ec5 Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1ec5 Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

1ec5 Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HarelM commented Apr 14, 2023

ZeLonewolf commented Apr 14, 2023

HarelM commented Apr 14, 2023

ZeLonewolf commented May 2, 2023 • edited Loading

1ec5 commented May 2, 2023 • edited Loading

HarelM commented Jun 28, 2023

ZeLonewolf commented Jun 28, 2023

ZeLonewolf commented Apr 13, 2023 •

edited

Loading

HarelM commented Apr 13, 2023 •

edited

Loading

1ec5 Apr 13, 2023 •

edited

Loading

1ec5 Apr 13, 2023 •

edited

Loading

1ec5 Apr 13, 2023 •

edited

Loading

1ec5 Apr 27, 2023 •

edited

Loading

ZeLonewolf commented May 2, 2023 •

edited

Loading

1ec5 commented May 2, 2023 •

edited

Loading